--- library_name: transformers tags: - tokenizer - code - multilingual - programming license: apache-2.0 base_model: - openai-community/gpt2 --- # CodeSearchNet Multilingual Tokenizer A specialized tokenizer trained on code from 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) using the CodeSearchNet dataset. ## Model Details ### Model Description This tokenizer is based on GPT-2's tokenizer but retrained specifically for source code across multiple programming languages. It provides more efficient tokenization for code compared to general-purpose tokenizers. - **Model type:** BPE Tokenizer - **Languages:** Python, Java, JavaScript, PHP, Ruby, Go - **Vocabulary size:** 64,000 tokens - **Finetuned from:** GPT-2 tokenizer ## Uses ### Direct Use This tokenizer is designed for preprocessing source code before training or inference with language models. It's particularly useful for: - Code generation models - Code completion systems - Code analysis and understanding tasks - Multi-language programming assistants ## Performance Compared to the original GPT-2 tokenizer, this specialized tokenizer achieves: - **Python**: 25% fewer tokens on average - **Java**: 31% fewer tokens on average - **JavaScript**: 21% fewer tokens on average - **Go**: 14% fewer tokens on average - **PHP**: 14% fewer tokens on average - **Ruby**: 13% fewer tokens on average ## How to Get Started ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("helmo/code-search-net-multilang-tokenizer") # Example usage code = '''public class Calculator { public int add(int a, int b) { return a + b; } }''' tokens = tokenizer.tokenize(code) token_ids = tokenizer.encode(code) ``` ## Training Details ### Training Data Trained on the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) which contains: - ~2M code functions across 6 programming languages - Real-world code from GitHub repositories - Functions with associated documentation ### Training Procedure - **Base model:** GPT-2 tokenizer (50,257 vocab) - **Training method:** BPE (Byte-Pair Encoding) - **Final vocabulary:** 64,000 tokens - **Training corpus:** Combined functions from all 6 languages in CodeSearchNet ## Technical Specifications ### Model Architecture - **Algorithm:** Byte-Pair Encoding (BPE) - **Vocabulary size:** 64,000 - **Special tokens:** Inherited from GPT-2 tokenizer - **Subword handling:** Optimized for code syntax and patterns ## Citation ```bibtex @misc{codesearchnet-multilang-tokenizer, title={CodeSearchNet Multilingual Tokenizer}, author={Hélder Monteiro}, year={2025}, howpublished={Hugging Face Model Hub}, } ``` ## Dataset Reference ```bibtex @article{husain2019codesearchnet, title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} } ```