speakleash
/

Bielik-11B-v2.6-Instruct-FP8-Dynamic

@@ -9,15 +9,15 @@ tags:
 - 8bit
 inference: false
 pipeline_tag: text-generation
-base_model: speakleash/Bielik-11B-v2.5-Instruct
 ---
 <p align="center">
   <img src="https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1-GGUF/raw/main/speakleash_cyfronet.png">
 </p>
-# Bielik-11B-v2.5-Instruct-FP8-Dynamic
-This model was obtained by quantizing the weights and activations of [Bielik-11B-v2.5-Instruct](https://huggingface.co/speakleash/Bielik-11B-v2.5-Instruct) to FP8 data type, ready for inference with vLLM >= 0.5.0 or SGLang.
 AutoFP8 is used for quantization. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
@@ -33,7 +33,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
-model_id = "speakleash/Bielik-11B-v2.5-Instruct-FP8-Dynamic"
 sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=4096)
@@ -61,7 +61,7 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
 Launch a server of SGLang Runtime:
 ```
-python -m sglang.launch_server --model-path speakleash/Bielik-11B-v2.5-Instruct-FP8-Dynamic --port 30000
 ```
 Then you can send http request or use OpenAI Compatible API.
@@ -89,7 +89,7 @@ print(response)
 * **Developed by:** [SpeakLeash](https://speakleash.org/) & [ACK Cyfronet AGH](https://www.cyfronet.pl/)
 * **Language:** Polish
 * **Model type:** causal decoder-only
-* **Quant from:** [Bielik-11B-v2.5-Instruct](https://huggingface.co/speakleash/Bielik-11B-v2.5-Instruct)
 * **Finetuned from:** [Bielik-11B-v2](https://huggingface.co/speakleash/Bielik-11B-v2)
 * **License:** Apache 2.0 and [Terms of Use](https://bielik.ai/terms/)

 - 8bit
 inference: false
 pipeline_tag: text-generation
+base_model: speakleash/Bielik-11B-v2.6-Instruct
 ---
 <p align="center">
   <img src="https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1-GGUF/raw/main/speakleash_cyfronet.png">
 </p>
+# Bielik-11B-v2.6-Instruct-FP8-Dynamic
+This model was obtained by quantizing the weights and activations of [Bielik-11B-v2.6-Instruct](https://huggingface.co/speakleash/Bielik-11B-v2.6-Instruct) to FP8 data type, ready for inference with vLLM >= 0.5.0 or SGLang.
 AutoFP8 is used for quantization. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
+model_id = "speakleash/Bielik-11B-v2.6-Instruct-FP8-Dynamic"
 sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=4096)
 Launch a server of SGLang Runtime:
 ```
+python -m sglang.launch_server --model-path speakleash/Bielik-11B-v2.6-Instruct-FP8-Dynamic --port 30000
 ```
 Then you can send http request or use OpenAI Compatible API.
 * **Developed by:** [SpeakLeash](https://speakleash.org/) & [ACK Cyfronet AGH](https://www.cyfronet.pl/)
 * **Language:** Polish
 * **Model type:** causal decoder-only
+* **Quant from:** [Bielik-11B-v2.6-Instruct](https://huggingface.co/speakleash/Bielik-11B-v2.6-Instruct)
 * **Finetuned from:** [Bielik-11B-v2](https://huggingface.co/speakleash/Bielik-11B-v2)
 * **License:** Apache 2.0 and [Terms of Use](https://bielik.ai/terms/)