File size: 797 Bytes
5fa1a76 |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
If you want to load these other weights in a different format, use the torch_dtype parameter: from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "TheBloke/zephyr-7B-alpha-AWQ" model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32) AWQ quantization can also be combined with FlashAttention-2 to further accelerate inference: from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0") Fused modules Fused modules offers improved accuracy and performance and it is supported out-of-the-box for AWQ modules for Llama and Mistral architectures, but you can also fuse AWQ modules for unsupported architectures. |