metadata
language:
- en
metrics:
- accuracy
- bertscore
- f1
- recall
- precision
base_model:
- unsloth/mistral-7b-instruct-v0.2-bnb-4bit
library_name: transformers
tags:
- text-generation-inference
- text-generation
- unsloth
- mistral
- trl
- sft
Mistral 7b instruct
This model is a fine-tuned version of mistral-7b-instruct-v0.2-bnb-4bit on the EngSaf dataset for Automatic Essay Grading.
Robust performance on tasks involving Automatic Essay Grading to give a score and rationale
It achieves the following results on the evaluation set:
- Loss: 1.1961
- Score Precision: 0.5952
- Score Recall: 0.5519
- Score F1: 0.5434
- Score Accuracy: 0.5521
- Rationale Precision: 0.6438
- Rationale Recall: 0.6315
- Rationale F1: 0.6351
Model Details
- Base Model: Mistral 7B: https://arxiv.org/abs/2310.06825
- Fine-tuning Dataset: EngSaf: https://arxiv.org/abs/2407.12818.
- Task: Automatic Essay Grading
Training Data
The model is fine-tuned in the EngSaf dataset, curated for Automatic Essay Grading. EngSaf consists of student responses annotated with
- Questions: Typically short-answer or essay-type.
- Correct Answer: answers provided by teachers.
- Student Answers: Actual responses written by students.
- Output Label: The actual student score.
- Feedback: Explanations justifying the given scores.
Example Usage
Below is an example of how to use the model with the Hugging Face Transformers library:
import torch
from unsloth import FastLanguageModel
from transformers import AutoModelForCausalLM, AutoTokenizer
model, tokenizer = FastLanguageModel.from_pretrained(model_name="amjad-awad/mistral-7b-instruct-v0.2-bnb-4bit-EngSaf-96K",max_seq_length=2048,load_in_4bit=True)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing=True,
random_state=3407,
)
user_content = (
"Provide both a score and a rationale by evaluating the student's answer strictly within the mark scheme range, "
"grading based on how well it meets the question's requirements by comparing the student answer to the reference answer.\n"
"Question: What is photosynthesis?\n"
"Reference Answer: Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize nutrients from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a by-product.\n"
"Student Answer: Photosynthesis is how plants make their food using sunlight and carbon dioxide. It also gives off oxygen.\n"
"Mark Scheme: {'1':'Mentions use of sunlight', '2':'Mentions carbon dioxide and water', '3':'Mentions production of oxygen', '4':'Explains synthesis of nutrients or food', '5':'Mentions chlorophyll or green pigment'}"
)
user = [
{"role":"system", "content": "You are a grading assistant. Evaluate student answers based on the mark scheme. Respond only in JSON format with keys 'score' (int) and 'rationale' (string)."},
{"role":"user", "content": user_content},
]
inputs = tokenizer.apply_chat_template(user, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128, temperature=0.2, top_k=5, do_sample=False)[0]
new_generated_ids = generated_ids[inputs["input_ids"].shape[1]:]
generated_text = tokenizer.decode(new_generated_ids, skip_special_tokens=True)
print(generated_text)
Results:
{"score": 5, "rationale": "Your answer is correct. You have accurately described the process of photosynthesis, mentioning the use of sunlight, carbon dioxide, and water, and the production of food and oxygen as by-products. Keep up the good work!"}
Training hyperparameters
The following hyperparameters were used during training:
- per_device_train_batch_size:1
- per_device_eval_batch_size:1
- gradient_accumulation_steps:8
- eval_strategy:"steps"
- save_strategy:"steps"
- eval_steps:10
- logging_dir:"./logs"
- logging_steps:10
- save_total_limit:1
- learning_rate:2e-5
- warmup_steps:100
- weight_decay:0.01
- num_train_epochs:3
- load_best_model_at_end:True
- lr_scheduler_type:"cosine"
- metric_for_best_model:"eval_loss"
- greater_is_better:False
Training results
Step | Training Loss | Validation Loss |
---|---|---|
10 | 3.247800 | 3.295356 |
20 | 3.224100 | 3.216746 |
30 | 3.137600 | 3.078115 |
40 | 2.919600 | 2.877193 |
50 | 2.767000 | 2.640667 |
60 | 2.488400 | 2.380044 |
70 | 2.245300 | 2.097524 |
80 | 1.993600 | 1.833924 |
90 | 1.663000 | 1.533552 |
100 | 1.460800 | 1.377964 |
110 | 1.343200 | 1.310175 |
120 | 1.307700 | 1.264394 |
130 | 1.252400 | 1.237222 |
140 | 1.221500 | 1.208290 |
150 | 1.169100 | 1.203079 |
160 | 1.120900 | 1.197736 |
170 | 1.196100 | 1.194299 |
Framework versions
- Transformers 4.51.3
- Pytorch 2.7.0
- Datasets 3.6.0
- Unsloth 2025.5.6