DESUCLUB/Llama-3.1-8B-Instruct-bf16-quantized.w8a8

This is a custom W8A16 quantized version of meta-llama/Llama-3.1-8B-Instruct.

Quantization Details

  • Method: Custom W8A16 (8-bit weights, 16-bit activations)
  • Weight precision: INT8
  • Scale precision: BF16
  • Quantization: Symmetric per-channel
  • Zero points: None (symmetric)

Model Structure

The quantized model contains:

  • .weight: INT8 quantized weights
  • .weight_scale: BF16 scale parameters (trainable)
  • Standard embedding and normalization layers in original precision

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Note: This requires custom quantization code to load properly
model = AutoModelForCausalLM.from_pretrained("DESUCLUB/Llama-3.1-8B-Instruct-bf16-quantized.w8a8")
tokenizer = AutoTokenizer.from_pretrained("DESUCLUB/Llama-3.1-8B-Instruct-bf16-quantized.w8a8")
Downloads last month
2
Safetensors
Model size
8.03B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DESUCLUB/Llama-3.1-8B-Instruct-bf16-quantized.w8a8

Quantized
(466)
this model