Mistral-7B-Instruct-v0.2 Fine-tuned with DPO on LIMA-derived Preferences

This repository contains a version of the unsloth/mistral-7b-instruct-v0.2 model fine-tuned using Direct Preference Optimization (DPO). The fine-tuning process utilized a preference dataset derived from the LIMA dataset and responses generated by the base Mistral-7B-Instruct-v0.2 model, which were then ranked using PairRM from the LLM-Blender toolkit.

The goal of this fine-tuning is to align the model further with human preferences for helpful and harmless responses.

Model Details

Base Model: unsloth/mistral-7b-instruct-v0.2 (a version of mistralai/Mistral-7B-Instruct-v0.2 optimized with Unsloth)
Fine-tuning Method: Direct Preference Optimization (DPO)
Preference Data Source:
- Instructions: Sampled from the LIMA dataset.
- Responses: Generated by mistralai/Mistral-7B-Instruct-v0.2.
- Ranking: Performed by llm-blender/PairRM.
- The preference dataset used for DPO is available at: ssui-liu/lima-mistral7b-pairrm-preference (or your specified dataset name if different).
Training Framework: Unsloth for efficient LoRA-based DPO training.

Training Configuration

The DPO training was performed with the following key hyperparameters (refer to scripts/dpo_training.py for full details):

LoRA r: 64
LoRA alpha: 64
LoRA dropout: 0.0
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Batch Size (per device): 2
Gradient Accumulation Steps: 4
Learning Rate: 5e-6
Number of Epochs: 3 (or specify if max_steps was used)
DPO Beta (β): 0.1
DPO Loss Type: sigmoid (or specify if hinge, ipo, kto was used)
Max Sequence Length: 2048

How to Use

This model is a PEFT (LoRA) adapter. To use it, you need to load the base model (unsloth/mistral-7b-instruct-v0.2) and then apply these LoRA weights.

from unsloth import FastLanguageModel

model_name = "unsloth/mistral-7b-instruct-v0.2"
peft_model_id = "ssui-liu/mistral-7b-lima-dpo" # Or your specific model ID
max_seq_length = 2048 # Or your preferred max sequence length

# Load the base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=None,  # Autodetect
    load_in_4bit=True, # Or False if not using 4bit
)

# Load the PEFT adapter
model = FastLanguageModel.get_peft_model(
    model,
    peft_model_id,
    # Ensure LoRA parameters match those during training if not automatically inferred
    # r = 64,
    # lora_alpha = 64,
    # target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    # etc.
)

FastLanguageModel.for_inference(model) # Prepare model for inference

# Example usage
prompt = "What are the best ways to learn a new programming language?"
messages = [
    {"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

outputs = model.generate(input_ids = input_ids, max_new_tokens = 256, use_cache = True)
response = tokenizer.batch_decode(outputs)[0]

# Extract just the assistant's response
# (This might vary slightly based on the exact chat template and base model version)
response_parts = response.split("[/INST]")
if len(response_parts) > 1:
    assistant_response = response_parts[1].strip()
else:
    assistant_response = response # Fallback if pattern not found

print(assistant_response)

ssui-liu
/

mistral-7b-lima-dpo

Mistral-7B-Instruct-v0.2 Fine-tuned with DPO on LIMA-derived Preferences

Model Details

Training Configuration

How to Use

Model tree for ssui-liu/mistral-7b-lima-dpo