---
library_name: transformers
tags:
- tokenizer
- code
- multilingual
- programming
license: apache-2.0
base_model:
- openai-community/gpt2
---

# CodeSearchNet Multilingual Tokenizer

A specialized tokenizer trained on code from 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) using the CodeSearchNet dataset.

## Model Details

### Model Description

This tokenizer is based on GPT-2's tokenizer but retrained specifically for source code across multiple programming languages. It provides more efficient tokenization for code compared to general-purpose tokenizers.

- **Model type:** BPE Tokenizer
- **Languages:** Python, Java, JavaScript, PHP, Ruby, Go
- **Vocabulary size:** 64,000 tokens
- **Finetuned from:** GPT-2 tokenizer

## Uses

### Direct Use

This tokenizer is designed for preprocessing source code before training or inference with language models. It's particularly useful for:

- Code generation models
- Code completion systems
- Code analysis and understanding tasks
- Multi-language programming assistants

## Performance

Compared to the original GPT-2 tokenizer, this specialized tokenizer achieves:

- **Python**: 25% fewer tokens on average
- **Java**: 31% fewer tokens on average  
- **JavaScript**: 21% fewer tokens on average
- **Go**: 14% fewer tokens on average
- **PHP**: 14% fewer tokens on average
- **Ruby**: 13% fewer tokens on average

## How to Get Started

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("helmo/code-search-net-multilang-tokenizer")

# Example usage
code = '''public class Calculator {
    public int add(int a, int b) {
        return a + b;
    }
}'''

tokens = tokenizer.tokenize(code)
token_ids = tokenizer.encode(code)
```

## Training Details

### Training Data

Trained on the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) which contains:
- ~2M code functions across 6 programming languages
- Real-world code from GitHub repositories
- Functions with associated documentation

### Training Procedure

- **Base model:** GPT-2 tokenizer (50,257 vocab)
- **Training method:** BPE (Byte-Pair Encoding)
- **Final vocabulary:** 64,000 tokens
- **Training corpus:** Combined functions from all 6 languages in CodeSearchNet

## Technical Specifications

### Model Architecture
- **Algorithm:** Byte-Pair Encoding (BPE)
- **Vocabulary size:** 64,000
- **Special tokens:** Inherited from GPT-2 tokenizer
- **Subword handling:** Optimized for code syntax and patterns

## Citation

```bibtex
@misc{codesearchnet-multilang-tokenizer,
  title={CodeSearchNet Multilingual Tokenizer},
  author={Hélder Monteiro},
  year={2025},
  howpublished={Hugging Face Model Hub},
}
```

## Dataset Reference

```bibtex
@article{husain2019codesearchnet,
  title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search},
  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
  journal={arXiv preprint arXiv:1909.09436},
  year={2019}
}
```