Fine-tuned Qwen 2.5 14B Instruct for Scientific Named Entity Recognition (NER)

This repository hosts a fine-tuned version of the Qwen/Qwen2.5-14B-Instruct model, specifically adapted for Named Entity Recognition (NER) in scientific texts. The model has been instruction-tuned to extract specific entity types, outputting them in a structured JSON format.

Model Description

This model is a specialized version of Qwen 2.5 14B Instruct, designed to identify and extract key information from scientific documents. It leverages an instruction-following approach, where the model is prompted with a detailed task description (including the target entity types and output format) and the text to be annotated. Its primary application is to automate the extraction of structured data from research papers, abstracts, or other scientific literature.

Fine-tuning Details

Base Model: Qwen/Qwen2.5-14B-Instruct
Task: Named Entity Recognition (NER)
Domain: Scientific Literature
Fine-tuning Method: LoRA (Low-Rank Adaptation) with QLoRA (4-bit quantization) using the PEFT library.
Training Data: The model was fine-tuned on a custom dataset (ner_train.json and potentially ner_train_1.json), which consists of scientific text annotated with specific NER labels. The data was transformed into an instruction-response format suitable for instruction-tuning large language models.
Key LoRA Parameters: r=64, lora_alpha=16, lora_dropout=0.1.
Training Hyperparameters:
- per_device_train_batch_size: 4
- gradient_accumulation_steps: 8
- learning_rate: 2e-4
- max_steps: 500 (adjust if you used a different number)
- optim: paged_adamw_8bit
- fp16: True
- gradient_checkpointing: True
Hardware: Training was performed on a GPU (e.g., NVIDIA H100).

Named Entity Types

The model is trained to recognize and extract the following eight scientific entity types:

Ecosystem: Refers to the type of natural or artificial environment, land use, or specific study site characteristics beyond just coordinates/city.
Focalpoint: Refers to the main species, organism, or subject of study.
Locationofstudy: Refers to information about the physical setting, coordinates, geographical location, and name of country/city.
Mainhypothesisandcorrespondingresults: Refers to the primary hypothesis tested in the study and its direct, corresponding findings.
Method: Refers to method/technique/instrument used in the study.
Reccomendationsandsuggestions: Refers to proposals for future actions, research, or applications based on the study's findings.
Researchquestions: Refers to the specific problems, gaps, or questions the research aims to address.
Timeperiodofstudy: Refers to information about the timing of the study, including beginning and end date, total duration, timing, and duration of fieldwork.

How to Use

You can load and use this model with the Hugging Face transformers library.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# --- Configuration ---
# Replace 'YOUR_HF_USERNAME' with your Hugging Face username or organization
MODEL_REPO_ID = "MikeACedric/finetuned-qwen-for-ner-v2" # Adjust this if you chose a different repo name
GENERATION_MAX_NEW_TOKENS = 1024 # Max tokens for the JSON output

# Define the NER labels used during training
NER_LABELS = [
    "Ecosystem", "Focalpoint", "Locationofstudy", "Mainhypothesisandcorrespondingresults",
    "Method", "Reccomendationsandsuggestions", "Researchquestions", "Timeperiodofstudy"
]

# --- Prompts (must match training prompts exactly) ---
SYSTEM_PROMPT = "You are an expert in scientific reasoning and information extraction."
USER_CONTEXT_TEMPLATE = """## Task Description
Perform step-by-step reasoning to identify Named Entities in the scientific text.

Each JSON key must be a single, exact substring from the input text. Each JSON value must be exactly one of these eight labels (no spelling variants):

1. "Ecosystem": Refers to the type of natural or artificial environment, land use, or specific study site characteristics beyond just coordinates/city.
2. "Focalpoint": Refers to the main species, organism, or subject of study.
3. "Locationofstudy": Refers to information about the physical setting, coordinates, geographical location, and name of country/city.
4. "Mainhypothesisandcorrespondingresults": Refers to the primary hypothesis tested in the study and its direct, corresponding findings.
5. "Method": Refers to method/technique/instrument used in the study.
6. "Reccomendationsandsuggestions": Refers to proposals for future actions, research, or applications based on the study's findings.
7. "Researchquestions": Refers to the specific problems, gaps, or questions the research aims to address.
8. "Timeperiodofstudy": Refers to information about the timing of the study, including beginning and end date, total duration, timing, and duration of fieldwork. Usually found in the Abstract/Introduction/Methods sections.

- If the text contains multiple distinct phrases all belonging to the same label (for example, four different entities under “Focalpoint”), you must emit each phrase as its own JSON key.
- Never group more than one phrase under a single key.
- The output must be a single, flat JSON dictionary. No lists or nested objects.
- Do not output any extra text, no commentary, no markdown—just the JSON.
- Ensure all keys are exact, verbatim substrings from the input text. Do not paraphrase or alter the text.

### Input
{text}

### Output
Produce ONLY the JSON dictionary. Do NOT include any other text, explanations, or markdown. Start the JSON directly and end it immediately after the final brace.

A single JSON dictionary mapping each exact entity phrase to its correct label:
{{
"""

# --- Load Model and Tokenizer ---
print(f"Loading tokenizer from {MODEL_REPO_ID}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Loading base model and then LoRA adapter from {MODEL_REPO_ID}...")
# Load the base model first (Qwen2.5-14B-Instruct)
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-14B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    load_in_4bit=True # Important if trained with QLoRA
)

# Load the PEFT adapter
model = PeftModel.from_pretrained(base_model, MODEL_REPO_ID)
model = model.merge_and_unload() # Merge LoRA weights into the base model for inference
model.eval() # Set model to evaluation mode

print("Model loaded and merged successfully!")

# --- Inference Function ---
@torch.no_grad()
def extract_entities(text: str) -> dict:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_CONTEXT_TEMPLATE.format(text=text)}
    ]
    
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    generated_ids = model.generate(
        input_ids,
        max_new_tokens=GENERATION_MAX_NEW_TOKENS,
        pad_token_id=tokenizer.eos_token_id,
        attention_mask=torch.ones_like(input_ids)
    )

    response_ids = generated_ids[0, input_ids.shape[1]:]
    model_output = tokenizer.decode(response_ids, skip_special_tokens=True)
    
    # Basic JSON extraction (you might want a more robust parser)
    try:
        start_brace = model_output.find('{')
        end_brace = model_output.rfind('}')
        if start_brace != -1 and end_brace != -1 and start_brace < end_brace:
            json_str = model_output[start_brace : end_brace + 1]
            # Simple fix for trailing commas if any
            json_str = json_str.replace(', }', '}')
            json_str = json_str.replace(',]', ']')
            return json.loads(json_str)
        else:
            print(f"Warning: No valid JSON found in model output: {model_output}")
            return {}
    except Exception as e:
        print(f"Error parsing JSON from model output: {e}")
        print(f"Raw output: {model_output}")
        return {}


# --- Example Usage ---
example_text_1 = "Blue carbon habitats in Aotearoa New Zealand—opportunities for conservation, restoration, and carbon sequestration. This study was conducted from January 2023 to June 2024."

print("\n--- Example 1 Inference ---")
extracted_json_1 = extract_entities(example_text_1)
print("Extracted Entities (JSON):")
import json
print(json.dumps(extracted_json_1, indent=2, ensure_ascii=False))

example_text_2 = "The primary research question was to investigate the efficacy of CRISPR-Cas9 genome editing in maize for drought resistance. Our hypothesis was that edited plants would show improved water retention."

print("\n--- Example 2 Inference ---")
extracted_json_2 = extract_entities(example_text_2)
print("Extracted Entities (JSON):")
print(json.dumps(extracted_json_2, indent=2, ensure_ascii=False))

MikeACedric
/

finetuned-qwen-for-ner-v2

Fine-tuned Qwen 2.5 14B Instruct for Scientific Named Entity Recognition (NER)

Model Description

Fine-tuning Details

Named Entity Types

How to Use

Model tree for MikeACedric/finetuned-qwen-for-ner-v2