π§ Fine-Tuned Qwen 2.5 Coder: Python Data Engineering Assistant
π Model Overview
This model is a fine-tuned version of Qwen2.5-Coder-0.5B-Instruct, adapted specifically to write clean, structured Python code for data engineering and data transformation tasks. It is especially effective for single-step operations such as joining datasets, handling quarters, replacing null values, and returning structured output.
Fine-tuned by S Prem Kaushik, this model is optimized for precision, clean code generation, and adherence to Pythonic best practices.
π― Objective
This model consistently follows best practices in data transformation, including:
- β
Column Collision Handling: Automatically applies
remove_column_collisions()
after joins. - π Quarter & Date Handling: Uses fiscal quarter mapping from a configurable dictionary.
- π§Ό NaT/NaN Replacement: Replaces
NaT
andNan
with PythonNone
. - π¦ Function-Scoped Imports: All
import
statements are inside functions. - π€ Input/Output Structure: Returns results as structured dictionaries using
.to_dict('records')
.
π§ͺ Training Data
- Format: JSONL with
system
,query
, andresponse
entries - Domain: Realistic one-step data manipulation tasks in pandas
- Coverage: Merging, joining, null replacement, quarter logic, grouping, etc.
- Author: S Prem Kaushik
π Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load the model
model = AutoModelForCausalLM.from_pretrained("10Prem09/finetuned_Qwen2.5_Coder_0.5B_Instruct", torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained("10Prem09/finetuned_Qwen2.5_Coder_0.5B_Instruct")
model.eval()
# Define prompt using raw role tags (no chat template)
system_prompt = "Role: Python Code Generator."
user_prompt = "Merge employee and department data, then filter by department to calculate the average salary for each employee."
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_prompt}\n<|assistant|>\n"
# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Generate response
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=300, # Ideally can be varied between 300 and 512.
repetition_penalty=1.3, # Can be changed from 1.2 to 1.4 to get ideal results.
pad_token_id=tokenizer.eos_token_id
)
# Decode and print result
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)
π Sample Dataset Entries
Example 1: Campaign ROI by Quarter with Column Collision Handling
{
"system": "Role: Python Code Generator...",
"query": "Analyze marketing campaign ROI by fiscal quarters, handling column collisions with a left join.",
"response": "```python\ndef campaign_roi_by_quarter(campaign_df, spending_df, fiscal_config):\n import pandas as pd\n import numpy as np\n campaign_df['launch_date'] = pd.to_datetime(campaign_df['launch_date'])\n campaign_df['year'] = campaign_df['launch_date'].dt.year\n campaign_df['month'] = campaign_df['launch_date'].dt.month\n\n def assign_quarter(row):\n for q, (start, end) in fiscal_config.items():\n m = row['month']\n if start > end:\n if m >= start or m <= end:\n return f\"Q{q} {row['year'] if m >= start else row['year']-1}\"\n elif start <= m <= end:\n return f\"Q{q} {row['year']}\"\n\n campaign_df['quarter'] = campaign_df.apply(assign_quarter, axis=1)\n merged_df = pd.merge(campaign_df, spending_df, on='campaign_id', how='left', suffixes=('', '_y'))\n merged_df = remove_column_collisions(merged_df, 'left')\n merged_df['roi'] = merged_df['revenue_generated'] / merged_df['spending']\n result = merged_df.groupby('quarter')['roi'].mean().reset_index()\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n return result.to_dict('records')\n\nfiscal_config = {1: (7,9), 2: (10,12), 3: (1,3), 4: (4,6)}\nresult = campaign_roi_by_quarter(source1, source2, fiscal_config)\n```"
}
Example 2: Supplier Deliveries by Fiscal Quarter
{
"system": "Role: Python Code Generator...",
"query": "Track supplier deliveries by fiscal quarters: Q1 (March-May), Q2 (June-August), Q3 (September-November), Q4 (December-February).",
"response": "```python\ndef deliveries_by_quarter(deliveries_df, fiscal_config):\n import pandas as pd\n import numpy as np\n deliveries_df['delivery_date'] = pd.to_datetime(deliveries_df['delivery_date'])\n deliveries_df['year'] = deliveries_df['delivery_date'].dt.year\n deliveries_df['month'] = deliveries_df['delivery_date'].dt.month\n\n def assign_quarter(row):\n for q, (start, end) in fiscal_config.items():\n m = row['month']\n if start > end:\n if m >= start or m <= end:\n return f\"Q{q} {row['year'] if m >= start else row['year']-1}\"\n elif start <= m <= end:\n return f\"Q{q} {row['year']}\"\n\n deliveries_df['quarter'] = deliveries_df.apply(assign_quarter, axis=1)\n result = deliveries_df.groupby('quarter').size().reset_index(name='delivery_count')\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n return result.to_dict('records')\n\nfiscal_config = {1: (3, 5), 2: (6, 8), 3: (9, 11), 4: (12, 2)}\nresult = deliveries_by_quarter(source1, fiscal_config)\n```"
}
Example 3: Join Project and Employee Tables
{
"system": "Role: Python Code Generator. Generate clean, function-based Python code based on user requests.",
"query": "Find all projects and their associated manager names by joining project and employee data.",
"response": "```python\ndef get_project_manager_info(projects_df, employees_df):\n import pandas as pd\n import numpy as np\n\n merged = projects_df.merge(employees_df, left_on='manager_id', right_on='employee_id', how='left')\n merged = remove_column_collisions(merged, 'left')\n\n result = merged[['project_name', 'manager_name']]\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n\n return result.to_dict('records')\n\nresult = get_project_manager_info(source1, source2)\n```"
}
π¦ Model Details
Base Model: Qwen2.5-Coder-0.5B-Instruct
Fine-Tuned By: S Prem Kaushik
Specialization: Python data manipulation for ETL, reporting, and time-based aggregation
π‘οΈ Limitations
Designed for single-step transformations; complex pipelines should be modularized.
Assumes remove_column_collisions() is available in the environment.
Chat-style prompt formatting is recommended for best results.
π¬ Contact & Dataset Access
If you are interested in accessing the fine-tuning dataset, reviewing the training code, or exploring potential collaborative opportunities, you are welcome to reach out.
Please contact me via my Hugging Face profile:
π https://huggingface.co/10Prem09
Additional contact links (e.g., GitHub or LinkedIn) are available on my profile page.
- Downloads last month
- 1
Model tree for 10Prem09/finetuned_Qwen2.5_Coder_0.5B_Instruct
Base model
Qwen/Qwen2.5-0.5B