Spaces:

Ahmadzei
/

RAG

Runtime error

RAG

File size: 797 Bytes

5fa1a76

If you want to load these other weights in a different format, use the torch_dtype parameter:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "TheBloke/zephyr-7B-alpha-AWQ"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)

AWQ quantization can also be combined with FlashAttention-2 to further accelerate inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0")

Fused modules
Fused modules offers improved accuracy and performance and it is supported out-of-the-box for AWQ modules for Llama and Mistral architectures, but you can also fuse AWQ modules for unsupported architectures.