🧠 Fine-Tuned Qwen 2.5 Coder: Python Data Engineering Assistant

📌 Model Overview

This model is a fine-tuned version of Qwen2.5-Coder-0.5B-Instruct, adapted specifically to write clean, structured Python code for data engineering and data transformation tasks. It is especially effective for single-step operations such as joining datasets, handling quarters, replacing null values, and returning structured output.

Fine-tuned by S Prem Kaushik, this model is optimized for precision, clean code generation, and adherence to Pythonic best practices.

🎯 Objective

This model consistently follows best practices in data transformation, including:

✅ Column Collision Handling: Automatically applies remove_column_collisions() after joins.
📅 Quarter & Date Handling: Uses fiscal quarter mapping from a configurable dictionary.
🧼 NaT/NaN Replacement: Replaces NaT and Nan with Python None.
📦 Function-Scoped Imports: All import statements are inside functions.
📤 Input/Output Structure: Returns results as structured dictionaries using .to_dict('records').

🧪 Training Data

Format: JSONL with system, query, and response entries
Domain: Realistic one-step data manipulation tasks in pandas
Coverage: Merging, joining, null replacement, quarter logic, grouping, etc.
Author: S Prem Kaushik

🛠 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model
model = AutoModelForCausalLM.from_pretrained("10Prem09/finetuned_Qwen2.5_Coder_0.5B_Instruct", torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained("10Prem09/finetuned_Qwen2.5_Coder_0.5B_Instruct")
model.eval()

# Define prompt using raw role tags (no chat template)
system_prompt = "Role: Python Code Generator."
user_prompt = "Merge employee and department data, then filter by department to calculate the average salary for each employee."
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_prompt}\n<|assistant|>\n"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=300, # Ideally can be varied between 300 and 512. 
        repetition_penalty=1.3, # Can be changed from 1.2 to 1.4 to get ideal results. 
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print result
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)

📊 Sample Dataset Entries

Example 1: Campaign ROI by Quarter with Column Collision Handling


{
  "system": "Role: Python Code Generator...",
  "query": "Analyze marketing campaign ROI by fiscal quarters, handling column collisions with a left join.",
  "response": "```python\ndef campaign_roi_by_quarter(campaign_df, spending_df, fiscal_config):\n import pandas as pd\n import numpy as np\n campaign_df['launch_date'] = pd.to_datetime(campaign_df['launch_date'])\n campaign_df['year'] = campaign_df['launch_date'].dt.year\n campaign_df['month'] = campaign_df['launch_date'].dt.month\n\n def assign_quarter(row):\n  for q, (start, end) in fiscal_config.items():\n   m = row['month']\n   if start > end:\n    if m >= start or m <= end:\n     return f\"Q{q} {row['year'] if m >= start else row['year']-1}\"\n   elif start <= m <= end:\n    return f\"Q{q} {row['year']}\"\n\n campaign_df['quarter'] = campaign_df.apply(assign_quarter, axis=1)\n merged_df = pd.merge(campaign_df, spending_df, on='campaign_id', how='left', suffixes=('', '_y'))\n merged_df = remove_column_collisions(merged_df, 'left')\n merged_df['roi'] = merged_df['revenue_generated'] / merged_df['spending']\n result = merged_df.groupby('quarter')['roi'].mean().reset_index()\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n return result.to_dict('records')\n\nfiscal_config = {1: (7,9), 2: (10,12), 3: (1,3), 4: (4,6)}\nresult = campaign_roi_by_quarter(source1, source2, fiscal_config)\n```"
}

Example 2: Supplier Deliveries by Fiscal Quarter

{
  "system": "Role: Python Code Generator...",
  "query": "Track supplier deliveries by fiscal quarters: Q1 (March-May), Q2 (June-August), Q3 (September-November), Q4 (December-February).",
  "response": "```python\ndef deliveries_by_quarter(deliveries_df, fiscal_config):\n import pandas as pd\n import numpy as np\n deliveries_df['delivery_date'] = pd.to_datetime(deliveries_df['delivery_date'])\n deliveries_df['year'] = deliveries_df['delivery_date'].dt.year\n deliveries_df['month'] = deliveries_df['delivery_date'].dt.month\n\n def assign_quarter(row):\n  for q, (start, end) in fiscal_config.items():\n   m = row['month']\n   if start > end:\n    if m >= start or m <= end:\n     return f\"Q{q} {row['year'] if m >= start else row['year']-1}\"\n   elif start <= m <= end:\n    return f\"Q{q} {row['year']}\"\n\n deliveries_df['quarter'] = deliveries_df.apply(assign_quarter, axis=1)\n result = deliveries_df.groupby('quarter').size().reset_index(name='delivery_count')\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n return result.to_dict('records')\n\nfiscal_config = {1: (3, 5), 2: (6, 8), 3: (9, 11), 4: (12, 2)}\nresult = deliveries_by_quarter(source1, fiscal_config)\n```"
}

Example 3: Join Project and Employee Tables

{
  "system": "Role: Python Code Generator. Generate clean, function-based Python code based on user requests.",
  "query": "Find all projects and their associated manager names by joining project and employee data.",
  "response": "```python\ndef get_project_manager_info(projects_df, employees_df):\n import pandas as pd\n import numpy as np\n\n merged = projects_df.merge(employees_df, left_on='manager_id', right_on='employee_id', how='left')\n merged = remove_column_collisions(merged, 'left')\n\n result = merged[['project_name', 'manager_name']]\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n\n return result.to_dict('records')\n\nresult = get_project_manager_info(source1, source2)\n```"
}

📦 Model Details

Base Model: Qwen2.5-Coder-0.5B-Instruct
Fine-Tuned By: S Prem Kaushik
Specialization: Python data manipulation for ETL, reporting, and time-based aggregation

🛡️ Limitations

Designed for single-step transformations; complex pipelines should be modularized.
Assumes remove_column_collisions() is available in the environment.
Chat-style prompt formatting is recommended for best results.

📬 Contact & Dataset Access

If you are interested in accessing the fine-tuning dataset, reviewing the training code, or exploring potential collaborative opportunities, you are welcome to reach out.

Please contact me via my Hugging Face profile:

🔗 https://huggingface.co/10Prem09

Additional contact links (e.g., GitHub or LinkedIn) are available on my profile page.

10Prem09
/

finetuned_Qwen2.5_Coder_0.5B_Instruct