input_ids = tokenizer(prompt, return_tensors="pt").input_ids | |
gen_tokens = model.generate( | |
input_ids, | |
do_sample=True, | |
temperature=0.9, | |
max_length=100, | |
) | |
gen_text = tokenizer.batch_decode(gen_tokens)[0] | |
Using Flash Attention 2 | |
Flash Attention 2 is an faster, optimized version of the model. |