Spaces:

Ahmadzei
/

RAG

Runtime error

added 3 more tables for large emb model

5fa1a76 over 1 year ago

797 Bytes

	If you want to load these other weights in a different format, use the torch_dtype parameter:

	from transformers import AutoModelForCausalLM, AutoTokenizer
	model_id = "TheBloke/zephyr-7B-alpha-AWQ"
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)

	AWQ quantization can also be combined with FlashAttention-2 to further accelerate inference:

	from transformers import AutoModelForCausalLM, AutoTokenizer
	model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0")

	Fused modules
	Fused modules offers improved accuracy and performance and it is supported out-of-the-box for AWQ modules for Llama and Mistral architectures, but you can also fuse AWQ modules for unsupported architectures.