---
license: apache-2.0
base_model:
- Qwen/Qwen2.5-Coder-0.5B-Instruct
pipeline_tag: text2text-generation
tags:
- code
---
# 🧠 Fine-Tuned Qwen 2.5 Coder: Python Data Engineering Assistant

## 📌 Model Overview

This model is a fine-tuned version of **Qwen2.5-Coder-0.5B-Instruct**, adapted specifically to write clean, structured Python code for **data engineering and data transformation tasks**. It is especially effective for single-step operations such as joining datasets, handling quarters, replacing null values, and returning structured output.

Fine-tuned by **S Prem Kaushik**, this model is optimized for precision, clean code generation, and adherence to Pythonic best practices.

---

## 🎯 Objective

This model consistently follows best practices in data transformation, including:

- ✅ **Column Collision Handling**: Automatically applies `remove_column_collisions()` after joins.
- 📅 **Quarter & Date Handling**: Uses fiscal quarter mapping from a configurable dictionary.
- 🧼 **NaT/NaN Replacement**: Replaces `NaT` and `Nan` with Python `None`.
- 📦 **Function-Scoped Imports**: All `import` statements are inside functions.
- 📤 **Input/Output Structure**: Returns results as structured dictionaries using `.to_dict('records')`.

---

## 🧪 Training Data

- **Format**: JSONL with `system`, `query`, and `response` entries
- **Domain**: Realistic one-step data manipulation tasks in pandas
- **Coverage**: Merging, joining, null replacement, quarter logic, grouping, etc.
- **Author**: S Prem Kaushik

---

## 🛠 Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model
model = AutoModelForCausalLM.from_pretrained("10Prem09/finetuned_Qwen2.5_Coder_0.5B_Instruct", torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained("10Prem09/finetuned_Qwen2.5_Coder_0.5B_Instruct")
model.eval()

# Define prompt using raw role tags (no chat template)
system_prompt = "Role: Python Code Generator."
user_prompt = "Merge employee and department data, then filter by department to calculate the average salary for each employee."
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_prompt}\n<|assistant|>\n"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=300, # Ideally can be varied between 300 and 512. 
        repetition_penalty=1.3, # Can be changed from 1.2 to 1.4 to get ideal results. 
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print result
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)

```
---

## 📊 Sample Dataset Entries

### Example 1: Campaign ROI by Quarter with Column Collision Handling
```json

{
  "system": "Role: Python Code Generator...",
  "query": "Analyze marketing campaign ROI by fiscal quarters, handling column collisions with a left join.",
  "response": "```python\ndef campaign_roi_by_quarter(campaign_df, spending_df, fiscal_config):\n import pandas as pd\n import numpy as np\n campaign_df['launch_date'] = pd.to_datetime(campaign_df['launch_date'])\n campaign_df['year'] = campaign_df['launch_date'].dt.year\n campaign_df['month'] = campaign_df['launch_date'].dt.month\n\n def assign_quarter(row):\n  for q, (start, end) in fiscal_config.items():\n   m = row['month']\n   if start > end:\n    if m >= start or m <= end:\n     return f\"Q{q} {row['year'] if m >= start else row['year']-1}\"\n   elif start <= m <= end:\n    return f\"Q{q} {row['year']}\"\n\n campaign_df['quarter'] = campaign_df.apply(assign_quarter, axis=1)\n merged_df = pd.merge(campaign_df, spending_df, on='campaign_id', how='left', suffixes=('', '_y'))\n merged_df = remove_column_collisions(merged_df, 'left')\n merged_df['roi'] = merged_df['revenue_generated'] / merged_df['spending']\n result = merged_df.groupby('quarter')['roi'].mean().reset_index()\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n return result.to_dict('records')\n\nfiscal_config = {1: (7,9), 2: (10,12), 3: (1,3), 4: (4,6)}\nresult = campaign_roi_by_quarter(source1, source2, fiscal_config)\n```"
}
```
### Example 2: Supplier Deliveries by Fiscal Quarter
```json
{
  "system": "Role: Python Code Generator...",
  "query": "Track supplier deliveries by fiscal quarters: Q1 (March-May), Q2 (June-August), Q3 (September-November), Q4 (December-February).",
  "response": "```python\ndef deliveries_by_quarter(deliveries_df, fiscal_config):\n import pandas as pd\n import numpy as np\n deliveries_df['delivery_date'] = pd.to_datetime(deliveries_df['delivery_date'])\n deliveries_df['year'] = deliveries_df['delivery_date'].dt.year\n deliveries_df['month'] = deliveries_df['delivery_date'].dt.month\n\n def assign_quarter(row):\n  for q, (start, end) in fiscal_config.items():\n   m = row['month']\n   if start > end:\n    if m >= start or m <= end:\n     return f\"Q{q} {row['year'] if m >= start else row['year']-1}\"\n   elif start <= m <= end:\n    return f\"Q{q} {row['year']}\"\n\n deliveries_df['quarter'] = deliveries_df.apply(assign_quarter, axis=1)\n result = deliveries_df.groupby('quarter').size().reset_index(name='delivery_count')\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n return result.to_dict('records')\n\nfiscal_config = {1: (3, 5), 2: (6, 8), 3: (9, 11), 4: (12, 2)}\nresult = deliveries_by_quarter(source1, fiscal_config)\n```"
}
```
### Example 3: Join Project and Employee Tables
```json
{
  "system": "Role: Python Code Generator. Generate clean, function-based Python code based on user requests.",
  "query": "Find all projects and their associated manager names by joining project and employee data.",
  "response": "```python\ndef get_project_manager_info(projects_df, employees_df):\n import pandas as pd\n import numpy as np\n\n merged = projects_df.merge(employees_df, left_on='manager_id', right_on='employee_id', how='left')\n merged = remove_column_collisions(merged, 'left')\n\n result = merged[['project_name', 'manager_name']]\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n\n return result.to_dict('records')\n\nresult = get_project_manager_info(source1, source2)\n```"
}
```
---
## 📦 Model Details
- **Base Model**: Qwen2.5-Coder-0.5B-Instruct

- **Fine-Tuned By**: S Prem Kaushik

- **Specialization**: Python data manipulation for ETL, reporting, and time-based aggregation

---

## 🛡️ Limitations
- Designed for single-step transformations; complex pipelines should be modularized.

- Assumes remove_column_collisions() is available in the environment.

- Chat-style prompt formatting is recommended for best results.

## 📬 Contact & Dataset Access
If you are interested in accessing the fine-tuning dataset, reviewing the training code, or exploring potential collaborative opportunities, you are welcome to reach out.

Please contact me via my Hugging Face profile:

🔗 https://huggingface.co/10Prem09

Additional contact links (e.g., GitHub or LinkedIn) are available on my profile page.