Qwen2.5-Coder-1.5B-Instruct-SFT-Distilled

The Qwen2.5-Coder-1.5B-Instruct-SFT-Distilled model has been distilled from the Qwen2.5-Coder-1.5B-Instruct-SFT model down to 1B parameters using a token-based knowledge distillation method.

Usage

Hugging Face

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "bunyaminergen/Qwen2.5-Coder-1.5B-Instruct-SFT-Distilled"
tokenize  = AutoTokenizer.from_pretrained(repo, padding_side="left")
model  = AutoModelForCausalLM.from_pretrained(
          repo,
          device_map="auto",
          torch_dtype="auto",
      ).eval()

system = "You are a senior Python developer."
user   = "Give me a Python implementation of bubble sort."

text = f"System: {system}\nUser: {user}\nAssistant:"
inputs = tokenize(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    out_ids = model.generate(**inputs, max_new_tokens=512)
print(tokenize.decode(out_ids[0], skip_special_tokens=True))

Dataset

bunyaminergen/Stable-Code-Python-SFT

Training

Hyperparameters

Hyperparameter	Value
Base Model	`bunyaminergen/Qwen2.5-Coder-1.5B-Instruct-SFT`
Knowledge Distillation Method	Token based
Task Type	`CAUSAL_LM`
Number of Epochs	`11`
Batch Size	`12`
Gradient Accumulation Steps	`2`
Effective Batch Size	`24` (12 × 2)
Learning Rate	`5e-5`
Optimizer	`AdamW`
Precision	`BF16 Mixed Precision`
Evaluation Strategy	`epoch`
Max Sequence Length	`256 tokens`
Logging Steps	every `epoch` steps
Save Checkpoint Steps	every `10000` steps
Experiment Tracking	`MLflow` (local)
Experiment Name	`StudentKnowledgeDistillation`
MLflow Run Name	`StudentKD`

Knowledge Distillation Configuration

Parameter	Value
Distillation Weight	`0.3`
Temperature	`0.5`
Loss Reduction	`batchmean`

Dataset

Train/Test Split: 90%/10%
Random Seed: 42
Train Batched: True
Eval Batched: True

Tokenizer Configuration

Truncation: Enabled (max_length=256)
Masked Language Modeling (MLM): False

Speeds, Sizes, Times

Total Training Time: ~7 hours
Checkpoint Frequency: every 10000 steps
Checkpoint Steps:
- checkpoint-10000
- checkpoint-13200 (final checkpoint)

Compute Infrastructure

Hardware:

GPU: 1 × NVIDIA L40S (48 GB VRAM)
RAM: 94 GB
CPU: 16 vCPU

Software:

OS: Ubuntu 22.04
Frameworks: PyTorch 2.4.0
CUDA Version: 12.4.1

Citation

@software{       Qwen2.5-Coder-1.5B-Instruct-SFT-Distilled,
  author       = {Bunyamin Ergen},
  title        = {{Qwen2.5-Coder-1.5B-Instruct-SFT-Distilled}},
  year         = {2025},
  month        = {04},
}

bunyaminergen
/

Qwen2.5-Coder-1.5B-Instruct-SFT-Distilled

Qwen2.5-Coder-1.5B-Instruct-SFT-Distilled

TableofContents

Usage

Hugging Face

Dataset

Training

Hyperparameters

Knowledge Distillation Configuration

Dataset

Tokenizer Configuration

Speeds, Sizes, Times

Compute Infrastructure

Licence

Links

Team

Contact

Citation

Model tree for bunyaminergen/Qwen2.5-Coder-1.5B-Instruct-SFT-Distilled

Dataset used to train bunyaminergen/Qwen2.5-Coder-1.5B-Instruct-SFT-Distilled