Spaces:

Ahmadzei
/

RAG

Runtime error

added 3 more tables for large emb model

5fa1a76 over 1 year ago

508 Bytes

	To boost inference speed even further, use the ExLlamaV2 kernels by configuring the exllama_config parameter:

	import torch
	from transformers import AutoModelForCausalLM, GPTQConfig
	gptq_config = GPTQConfig(bits=4, exllama_config={"version":2})
	model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config)

	Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT.