--- license: apache-2.0 base_model: - Zyphra/Zamba2-1.2B library_name: transformers --- # Model Card for Zamba2-1.2B-Instruct-v2 Zamba2-1.2B-Instruct-v2 is derived from the base [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) model through SFT and DPO training on instruction-following and conversational datasets. Zamba2-1.2B-Instruct-v2 is a hybrid model composed of state-space ([Mamba2](https://github.com/state-spaces/mamba)) and transformer blocks. ## Quick start ### Prerequisites To use Zamba2-1.2B-Instruct-v2, install `transformers`: `pip install transformers -U` To install dependencies necessary to run Mamba2 kernels, install `mamba-ssm` from source (due to compatibility issues with PyTorch) as well as `causal-conv1d`: 1. `git clone https://github.com/state-spaces/mamba.git` 2. `cd mamba && git checkout v2.1.0 && pip install .` 3. `pip install causal-conv1d` You can run the model without using the optimized Mamba2 kernels, but it is **not** recommended as it will result in significantly higher latency and memory usage. ### Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Instantiate model and tokenizer tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-1.2B-Instruct-v2") model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-1.2B-Instruct-v2", device_map="cuda", torch_dtype=torch.bfloat16) # Format the input as a chat template prompt = "What factors contributed to the fall of the Roman Empire?" sample = [{'role': 'user', 'content': prompt}] chat_sample = tokenizer.apply_chat_template(sample, tokenize=False) # Tokenize input and generate output input_ids = tokenizer(chat_sample, return_tensors='pt', add_special_tokens=False).to("cuda") outputs = model.generate(**input_ids, max_new_tokens=150, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False) print((tokenizer.decode(outputs[0]))) ``` ## Performance Zamba2-1.2B-Instruct-v2 achieves leading instruction-following performance for a model of its size and surpasses models of significantly larger size. For instance, Zamba2-1.2B-Instruct-v2 outperforms Gemma2-2B-Instruct, a very strong model over 2x its size. | Model | Size (B) | IFEval | BBH | GPQA | MATH (Hard) | MMLU Pro | MUSR | Aggregate | |:-------|:------:|:--------:|:-----:|:------:|:-----------:|:----------:|:------:|:-----------:| | Zamba2-1.2B-Instruct-v2 | 1.22 | 66.51 | 15.33 | 1.09 | 3.59 | 12.89 | 1.59 | 16.83 | | Zamba2-1.2B-Instruct | 1.22 | 41.76 | 17.49 | 1.73 | 2.75 | 14.69 | 2.44 | 13.48 | | Gemma-2-2b-it | 2.51 | 19.76 | 24.42 | 2.58 | 1.04 | 25.80 | 7.16 | 13.46 | | SmolLM2-1.7B-Instruct | 1.71 | 53.00 | 18.30 | 3.51 | 4.89 | 20.51 | 4.53 | 17.46 | | Qwen-2.5-1.5B-Instruct | 1.54 | 43.74 | 24.72 | 0.80 | 19.11 | 27.23 | 4.45 | 20.01 | | Llama-3.2-1B-Instruct | 1.24 | 56.88 | 16.65 | 2.03 | 6.85 | 17.79 | 1.68 | 16.98 | Due to its unique hybrid SSM architecture, Zamba2-1.2B-Instruct-v2 achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models.