๐ MiniCoderX: A Lightweight Transformer for Code Generation
MiniCoderX is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like LangChain and Ollama, making it ideal for rapid local experimentation.
โจ Features
- ๐ง Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
- ๐ฒ AST/CFG-aware encoding for code structure understanding
- ๐พ Syntax-constrained decoding using grammar rules and trees
- ๐ Multi-task heads: generation, summarization, translation, bug fixing
- โ๏ธ LangChain + Ollama integration for fast local deployment
- ๐งช Evaluated on HumanEval, CodeXGLUE, MBPP
๐๏ธ Model Architecture
Component | Description |
---|---|
Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) |
Structure-aware | AST and Control Flow Graph embeddings + positional masks |
Heads | Multi-task heads for flexible downstream use |
Decoder | Syntax-aware beam search (grammar constraints) |
Tokenizer | BPE or SentencePiece trained on code + comments |
๐ง Architectural Additions (SOTA Techniques)
๐ฒ AST/CFG Embeddings
Enhances understanding of code structure by:
- Adding AST node/edge embeddings to token inputs
- Including path embeddings between syntactic elements
- Graph-aware position encoding
Inspired by: StructCoder, AST-T5, Code4Struct
๐พ Syntax-Constrained Decoding
Improves generation accuracy and reduces invalid code by:
- Restricting token outputs using grammar constraints (BNF/PEG)
- Custom decoding logic (e.g., Tree traversal)
- Dynamic decoding masks based on token state
Inspired by: TreeGen, Code4Struct
๐ Multi-Task Learning Heads
Supports multiple tasks:
- Code generation (NL โ Code)
- Summarization (Code โ NL)
- Translation (Java โ Python)
- Code repair and completion
Inspired by: CodeT5+, CoTexT
โก LangChain + Ollama Integration
๐ก Why?
To enable:
- ๐งช Local testing and chaining of models via LangChain
- ๐ฆฎ Fast prototyping with Ollama for custom transformer backends
- ๐ Easy switch between small local models and larger remote APIs
๐ Integration Plan
from langchain.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
# Load MiniCoderX with Ollama
llm = Ollama(model="minicoderx") # Local model via Ollama
# Define code generation prompt
prompt = PromptTemplate(
input_variables=["instruction"],
template="Generate Python code for the task: {instruction}",
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("Sort a list of integers using quicksort")
print(result)
โ Ollama will be used to serve your fine-tuned SLM locally
โ LangChain will wrap it with prompts, chains, and memory features for interactivity
๐ฆ Datasets
Dataset | Use |
---|---|
The Stack (subset) | Pretraining corpus |
CodeSearchNet | Summarization, Search |
HumanEval | Code generation benchmark |
MBPP | Python programming prompts |
Bugs2Fix | Code repair |
Java-Python | Cross-language translation |
๐ฌ Training Objectives
- โ Span Masking (CodeT5-style)
- โ Contrastive pretraining
- โ Instruction tuning (natural prompt formatting)
- โ Auto-regressive generation
๐ Evaluation Benchmarks
Benchmark | Metric |
---|---|
HumanEval | Pass@1, BLEU |
MBPP | Accuracy |
CodeXGLUE | CodeBLEU, EM |
Unit Tests | Pass Rate |
๐งช Project Roadmap
โ Phase 1: MVP Model
- Train TinyCodeT5 model with span masking
- Evaluate on MBPP and HumanEval-lite
- Serve via Ollama + LangChain prompt chain
๐ Phase 2: Structural Learning
- Add AST/CFG encodings
- Introduce grammar-constrained decoding
- Multi-task training (gen, sum, repair)
๐ฆ Phase 3: Optimization & Packaging
- Distill from larger model (e.g., StarCoder)
- Add reinforcement fine-tuning via test cases
- Export to Hugging Face + Ollama integration
๐ ๏ธ Tools & Frameworks
- Hugging Face Transformers
- LangChain
- Ollama
- SentencePiece / BPE
- NetworkX for AST/CFG parsing
๐ค Contributing
Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!
๐ License
MIT License. Built for research and open experimentation.
๐ง Contact
Drop an issue or discussion on GitHub!