๐Ÿš€ MiniCoderX: A Lightweight Transformer for Code Generation

MiniCoderX is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like LangChain and Ollama, making it ideal for rapid local experimentation.


โœจ Features

  • ๐Ÿง  Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
  • ๐ŸŒฒ AST/CFG-aware encoding for code structure understanding
  • ๐Ÿ’พ Syntax-constrained decoding using grammar rules and trees
  • ๐Ÿ” Multi-task heads: generation, summarization, translation, bug fixing
  • โš™๏ธ LangChain + Ollama integration for fast local deployment
  • ๐Ÿงช Evaluated on HumanEval, CodeXGLUE, MBPP

๐Ÿ—๏ธ Model Architecture

Component Description
Base Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5)
Structure-aware AST and Control Flow Graph embeddings + positional masks
Heads Multi-task heads for flexible downstream use
Decoder Syntax-aware beam search (grammar constraints)
Tokenizer BPE or SentencePiece trained on code + comments

๐Ÿ”ง Architectural Additions (SOTA Techniques)

๐ŸŒฒ AST/CFG Embeddings

Enhances understanding of code structure by:

  • Adding AST node/edge embeddings to token inputs
  • Including path embeddings between syntactic elements
  • Graph-aware position encoding

Inspired by: StructCoder, AST-T5, Code4Struct

๐Ÿ’พ Syntax-Constrained Decoding

Improves generation accuracy and reduces invalid code by:

  • Restricting token outputs using grammar constraints (BNF/PEG)
  • Custom decoding logic (e.g., Tree traversal)
  • Dynamic decoding masks based on token state

Inspired by: TreeGen, Code4Struct

๐Ÿ” Multi-Task Learning Heads

Supports multiple tasks:

  • Code generation (NL โ†’ Code)
  • Summarization (Code โ†’ NL)
  • Translation (Java โ‡„ Python)
  • Code repair and completion

Inspired by: CodeT5+, CoTexT


โšก LangChain + Ollama Integration

๐Ÿ’ก Why?

To enable:

  • ๐Ÿงช Local testing and chaining of models via LangChain
  • ๐Ÿฆฎ Fast prototyping with Ollama for custom transformer backends
  • ๐Ÿ”„ Easy switch between small local models and larger remote APIs

๐Ÿ”Œ Integration Plan

from langchain.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# Load MiniCoderX with Ollama
llm = Ollama(model="minicoderx")  # Local model via Ollama

# Define code generation prompt
prompt = PromptTemplate(
    input_variables=["instruction"],
    template="Generate Python code for the task: {instruction}",
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("Sort a list of integers using quicksort")

print(result)

โœ… Ollama will be used to serve your fine-tuned SLM locally
โœ… LangChain will wrap it with prompts, chains, and memory features for interactivity


๐Ÿ“ฆ Datasets

Dataset Use
The Stack (subset) Pretraining corpus
CodeSearchNet Summarization, Search
HumanEval Code generation benchmark
MBPP Python programming prompts
Bugs2Fix Code repair
Java-Python Cross-language translation

๐Ÿ”ฌ Training Objectives

  • โœ… Span Masking (CodeT5-style)
  • โœ… Contrastive pretraining
  • โœ… Instruction tuning (natural prompt formatting)
  • โœ… Auto-regressive generation

๐Ÿ“Š Evaluation Benchmarks

Benchmark Metric
HumanEval Pass@1, BLEU
MBPP Accuracy
CodeXGLUE CodeBLEU, EM
Unit Tests Pass Rate

๐Ÿงช Project Roadmap

โœ… Phase 1: MVP Model

  • Train TinyCodeT5 model with span masking
  • Evaluate on MBPP and HumanEval-lite
  • Serve via Ollama + LangChain prompt chain

๐Ÿ” Phase 2: Structural Learning

  • Add AST/CFG encodings
  • Introduce grammar-constrained decoding
  • Multi-task training (gen, sum, repair)

๐Ÿ“ฆ Phase 3: Optimization & Packaging

  • Distill from larger model (e.g., StarCoder)
  • Add reinforcement fine-tuning via test cases
  • Export to Hugging Face + Ollama integration

๐Ÿ› ๏ธ Tools & Frameworks


๐Ÿค Contributing

Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!


๐Ÿ“œ License

MIT License. Built for research and open experimentation.


๐Ÿ“ง Contact

Drop an issue or discussion on GitHub!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train sanjudebnath/MiniCoderX