🚀 MiniCoderX: A Lightweight Transformer for Code Generation

MiniCoderX is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like LangChain and Ollama, making it ideal for rapid local experimentation.

✨ Features

🧠 Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
🌲 AST/CFG-aware encoding for code structure understanding
💾 Syntax-constrained decoding using grammar rules and trees
🔁 Multi-task heads: generation, summarization, translation, bug fixing
⚙️ LangChain + Ollama integration for fast local deployment
🧪 Evaluated on HumanEval, CodeXGLUE, MBPP

🏗️ Model Architecture

Component	Description
Base	Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5)
Structure-aware	AST and Control Flow Graph embeddings + positional masks
Heads	Multi-task heads for flexible downstream use
Decoder	Syntax-aware beam search (grammar constraints)
Tokenizer	BPE or SentencePiece trained on code + comments

🔧 Architectural Additions (SOTA Techniques)

🌲 AST/CFG Embeddings

Enhances understanding of code structure by:

Adding AST node/edge embeddings to token inputs
Including path embeddings between syntactic elements
Graph-aware position encoding

Inspired by: StructCoder, AST-T5, Code4Struct

💾 Syntax-Constrained Decoding

Improves generation accuracy and reduces invalid code by:

Restricting token outputs using grammar constraints (BNF/PEG)
Custom decoding logic (e.g., Tree traversal)
Dynamic decoding masks based on token state

Inspired by: TreeGen, Code4Struct

🔁 Multi-Task Learning Heads

Supports multiple tasks:

Code generation (NL → Code)
Summarization (Code → NL)
Translation (Java ⇄ Python)
Code repair and completion

Inspired by: CodeT5+, CoTexT

⚡ LangChain + Ollama Integration

💡 Why?

To enable:

🧪 Local testing and chaining of models via LangChain
🦮 Fast prototyping with Ollama for custom transformer backends
🔄 Easy switch between small local models and larger remote APIs

🔌 Integration Plan

from langchain.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# Load MiniCoderX with Ollama
llm = Ollama(model="minicoderx")  # Local model via Ollama

# Define code generation prompt
prompt = PromptTemplate(
    input_variables=["instruction"],
    template="Generate Python code for the task: {instruction}",
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("Sort a list of integers using quicksort")

print(result)

✅ Ollama will be used to serve your fine-tuned SLM locally
✅ LangChain will wrap it with prompts, chains, and memory features for interactivity

📦 Datasets

Dataset	Use
The Stack (subset)	Pretraining corpus
CodeSearchNet	Summarization, Search
HumanEval	Code generation benchmark
MBPP	Python programming prompts
Bugs2Fix	Code repair
Java-Python	Cross-language translation

🔬 Training Objectives

✅ Span Masking (CodeT5-style)
✅ Contrastive pretraining
✅ Instruction tuning (natural prompt formatting)
✅ Auto-regressive generation

📊 Evaluation Benchmarks

Benchmark	Metric
HumanEval	Pass@1, BLEU
MBPP	Accuracy
CodeXGLUE	CodeBLEU, EM
Unit Tests	Pass Rate

🧪 Project Roadmap

✅ Phase 1: MVP Model

Train TinyCodeT5 model with span masking
Evaluate on MBPP and HumanEval-lite
Serve via Ollama + LangChain prompt chain

🔁 Phase 2: Structural Learning

Add AST/CFG encodings
Introduce grammar-constrained decoding
Multi-task training (gen, sum, repair)

📦 Phase 3: Optimization & Packaging

Distill from larger model (e.g., StarCoder)
Add reinforcement fine-tuning via test cases
Export to Hugging Face + Ollama integration

🛠️ Tools & Frameworks

Hugging Face Transformers
LangChain
Ollama
SentencePiece / BPE
NetworkX for AST/CFG parsing

🤝 Contributing

Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!

📜 License

MIT License. Built for research and open experimentation.

📧 Contact

Drop an issue or discussion on GitHub!

sanjudebnath
/

MiniCoderX