--- license: apache-2.0 base_model: - Qwen/Qwen2.5-Coder-0.5B-Instruct pipeline_tag: text2text-generation tags: - code --- # ๐Ÿง  Fine-Tuned Qwen 2.5 Coder: Python Data Engineering Assistant ## ๐Ÿ“Œ Model Overview This model is a fine-tuned version of **Qwen2.5-Coder-0.5B-Instruct**, adapted specifically to write clean, structured Python code for **data engineering and data transformation tasks**. It is especially effective for single-step operations such as joining datasets, handling quarters, replacing null values, and returning structured output. Fine-tuned by **S Prem Kaushik**, this model is optimized for precision, clean code generation, and adherence to Pythonic best practices. --- ## ๐ŸŽฏ Objective This model consistently follows best practices in data transformation, including: - โœ… **Column Collision Handling**: Automatically applies `remove_column_collisions()` after joins. - ๐Ÿ“… **Quarter & Date Handling**: Uses fiscal quarter mapping from a configurable dictionary. - ๐Ÿงผ **NaT/NaN Replacement**: Replaces `NaT` and `Nan` with Python `None`. - ๐Ÿ“ฆ **Function-Scoped Imports**: All `import` statements are inside functions. - ๐Ÿ“ค **Input/Output Structure**: Returns results as structured dictionaries using `.to_dict('records')`. --- ## ๐Ÿงช Training Data - **Format**: JSONL with `system`, `query`, and `response` entries - **Domain**: Realistic one-step data manipulation tasks in pandas - **Coverage**: Merging, joining, null replacement, quarter logic, grouping, etc. - **Author**: S Prem Kaushik --- ## ๐Ÿ›  Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load the model model = AutoModelForCausalLM.from_pretrained("10Prem09/finetuned_Qwen2.5_Coder_0.5B_Instruct", torch_dtype=torch.float16).cuda() tokenizer = AutoTokenizer.from_pretrained("10Prem09/finetuned_Qwen2.5_Coder_0.5B_Instruct") model.eval() # Define prompt using raw role tags (no chat template) system_prompt = "Role: Python Code Generator." user_prompt = "Merge employee and department data, then filter by department to calculate the average salary for each employee." prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_prompt}\n<|assistant|>\n" # Tokenize input inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Generate response with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=300, # Ideally can be varied between 300 and 512. repetition_penalty=1.3, # Can be changed from 1.2 to 1.4 to get ideal results. pad_token_id=tokenizer.eos_token_id ) # Decode and print result generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_code) ``` --- ## ๐Ÿ“Š Sample Dataset Entries ### Example 1: Campaign ROI by Quarter with Column Collision Handling ```json { "system": "Role: Python Code Generator...", "query": "Analyze marketing campaign ROI by fiscal quarters, handling column collisions with a left join.", "response": "```python\ndef campaign_roi_by_quarter(campaign_df, spending_df, fiscal_config):\n import pandas as pd\n import numpy as np\n campaign_df['launch_date'] = pd.to_datetime(campaign_df['launch_date'])\n campaign_df['year'] = campaign_df['launch_date'].dt.year\n campaign_df['month'] = campaign_df['launch_date'].dt.month\n\n def assign_quarter(row):\n for q, (start, end) in fiscal_config.items():\n m = row['month']\n if start > end:\n if m >= start or m <= end:\n return f\"Q{q} {row['year'] if m >= start else row['year']-1}\"\n elif start <= m <= end:\n return f\"Q{q} {row['year']}\"\n\n campaign_df['quarter'] = campaign_df.apply(assign_quarter, axis=1)\n merged_df = pd.merge(campaign_df, spending_df, on='campaign_id', how='left', suffixes=('', '_y'))\n merged_df = remove_column_collisions(merged_df, 'left')\n merged_df['roi'] = merged_df['revenue_generated'] / merged_df['spending']\n result = merged_df.groupby('quarter')['roi'].mean().reset_index()\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n return result.to_dict('records')\n\nfiscal_config = {1: (7,9), 2: (10,12), 3: (1,3), 4: (4,6)}\nresult = campaign_roi_by_quarter(source1, source2, fiscal_config)\n```" } ``` ### Example 2: Supplier Deliveries by Fiscal Quarter ```json { "system": "Role: Python Code Generator...", "query": "Track supplier deliveries by fiscal quarters: Q1 (March-May), Q2 (June-August), Q3 (September-November), Q4 (December-February).", "response": "```python\ndef deliveries_by_quarter(deliveries_df, fiscal_config):\n import pandas as pd\n import numpy as np\n deliveries_df['delivery_date'] = pd.to_datetime(deliveries_df['delivery_date'])\n deliveries_df['year'] = deliveries_df['delivery_date'].dt.year\n deliveries_df['month'] = deliveries_df['delivery_date'].dt.month\n\n def assign_quarter(row):\n for q, (start, end) in fiscal_config.items():\n m = row['month']\n if start > end:\n if m >= start or m <= end:\n return f\"Q{q} {row['year'] if m >= start else row['year']-1}\"\n elif start <= m <= end:\n return f\"Q{q} {row['year']}\"\n\n deliveries_df['quarter'] = deliveries_df.apply(assign_quarter, axis=1)\n result = deliveries_df.groupby('quarter').size().reset_index(name='delivery_count')\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n return result.to_dict('records')\n\nfiscal_config = {1: (3, 5), 2: (6, 8), 3: (9, 11), 4: (12, 2)}\nresult = deliveries_by_quarter(source1, fiscal_config)\n```" } ``` ### Example 3: Join Project and Employee Tables ```json { "system": "Role: Python Code Generator. Generate clean, function-based Python code based on user requests.", "query": "Find all projects and their associated manager names by joining project and employee data.", "response": "```python\ndef get_project_manager_info(projects_df, employees_df):\n import pandas as pd\n import numpy as np\n\n merged = projects_df.merge(employees_df, left_on='manager_id', right_on='employee_id', how='left')\n merged = remove_column_collisions(merged, 'left')\n\n result = merged[['project_name', 'manager_name']]\n result.replace({pd.NaT: None, np.nan: None}, inplace=True)\n\n return result.to_dict('records')\n\nresult = get_project_manager_info(source1, source2)\n```" } ``` --- ## ๐Ÿ“ฆ Model Details - **Base Model**: Qwen2.5-Coder-0.5B-Instruct - **Fine-Tuned By**: S Prem Kaushik - **Specialization**: Python data manipulation for ETL, reporting, and time-based aggregation --- ## ๐Ÿ›ก๏ธ Limitations - Designed for single-step transformations; complex pipelines should be modularized. - Assumes remove_column_collisions() is available in the environment. - Chat-style prompt formatting is recommended for best results. ## ๐Ÿ“ฌ Contact & Dataset Access If you are interested in accessing the fine-tuning dataset, reviewing the training code, or exploring potential collaborative opportunities, you are welcome to reach out. Please contact me via my Hugging Face profile: ๐Ÿ”— https://huggingface.co/10Prem09 Additional contact links (e.g., GitHub or LinkedIn) are available on my profile page.