--- license: apache-2.0 tags: - text-generation - llama.cpp - gguf - quantization - merged-model language: - en library_name: gguf --- # merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct - GGUF Quantized Model This is a collection of GGUF quantized versions of [pravdin/merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct](https://huggingface.co/pravdin/merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct). ## 🌳 Model Tree This model was created by merging the following models: ``` pravdin/merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct ├── Merge Method: dare_ties ├── context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 └── unsloth/Llama-3.2-3B-Instruct ├── density: 0.6 ├── weight: 0.5 ``` **Merge Method**: DARE_TIES - Advanced merging technique that reduces interference between models ## 📊 Available Quantization Formats This repository contains multiple quantization formats optimized for different use cases: - **q4_k_m**: 4-bit quantization, medium quality, good balance of size and performance - **q5_k_m**: 5-bit quantization, higher quality, slightly larger size - **q8_0**: 8-bit quantization, highest quality, larger size but minimal quality loss ## 🚀 Usage ### With llama.cpp ```bash # Download a specific quantization wget https://huggingface.co/pravdin/merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct/resolve/main/merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct.q4_k_m.gguf # Run with llama.cpp ./main -m merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct.q4_k_m.gguf -p "Your prompt here" ``` ### With Python (llama-cpp-python) ```python from llama_cpp import Llama # Load the model llm = Llama(model_path="merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct.q4_k_m.gguf") # Generate text output = llm("Your prompt here", max_tokens=512) print(output['choices'][0]['text']) ``` ### With Ollama ```bash # Create a Modelfile echo 'FROM ./merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct.q4_k_m.gguf' > Modelfile # Create and run the model ollama create merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct -f Modelfile ollama run merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct "Your prompt here" ``` ## 📋 Model Details - **Original Model**: [pravdin/merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct](https://huggingface.co/pravdin/merged-context-labs-meta-llama-Llama-3.2-3B-Instruct-FP16-unsloth-Llama-3.2-3B-Instruct) - **Quantization Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp) - **License**: Same as original model - **Use Cases**: Optimized for local inference, edge deployment, and resource-constrained environments ## 🎯 Recommended Usage - **q4_k_m**: Best for most use cases, good quality/size trade-off - **q5_k_m**: When you need higher quality and have more storage/memory - **q8_0**: When you want minimal quality loss from the original model ## ⚡ Performance Notes GGUF models are optimized for: - Faster loading times - Lower memory usage - CPU and GPU inference - Cross-platform compatibility For best performance, ensure your hardware supports the quantization format you choose. --- *This model was automatically quantized using the Lemuru LLM toolkit.*