nielsr's picture
nielsr HF Staff
Improve model card
e907ec0 verified
|
raw
history blame
4.05 kB
metadata
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
tags:
  - long-sequence-generation
  - ultra-long-sequence
  - lossless-acceleration

Model Card for TokenSwift Models

TokenSwift is a novel framework designed to substantially accelerate the generation process of ultra-long sequences (up to 100K tokens) while maintaining the target model's inherent quality. This model significantly reduces generation time, offering lossless acceleration for long sequences. It's based on the research presented in From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens.

Model Details

This model is a framework for accelerating ultra-long sequence generation. It is designed to be compatible with various Large Language Models (LLMs) via the HuggingFace Transformers library.

  • Developed by: BIGAI-NLCO
  • Model type: Long Sequence Generation Accelerator
  • Language(s) (NLP): Multiple (supports models trained on various languages)
  • License: Apache-2.0
  • Finetuned from model [optional]: (Model-specific; varies based on the base model used)

Model Sources

Uses

TokenSwift is designed to accelerate the generation of ultra-long sequences (up to 100K tokens) for various LLMs. Its key benefit is lossless acceleration, meaning that the generated text quality is identical to that of the underlying LLM.

Direct Use

TokenSwift acts as a wrapper around existing LLMs, speeding up their generation process. It's designed to be easily integrated into existing workflows. See examples below.

How to Get Started with the Model

  1. Install: Follow the installation instructions in the GitHub repository.
  2. Download: Choose a pre-trained TokenSwift model from the Hugging Face Model Hub ([link to models on HuggingFace]).
  3. Inference: Use the provided inference script in the repository, adapting the parameters to your needs. See the GitHub README for detailed instructions and examples.

Example using transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "TokenSwift/TokenSwift-Llama-3.1-8B" # Replace with the actual model path
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)

prompt = "The key to success is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

generated_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    # Add other generation parameters as needed
)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)

Example using TokenSwift command line:

torchrun  --master-port 1111 --nproc_per_node=1 main.py \
    --model_type llama3_1 \
    --ckpt_path your_checkpoint_path \
    --prefill_len 4096 \
    --retrival_max_budget 4096 \
    --gen_len 102400 \
    --gamma 4 \
    --min_p 0.1 \
    --temperature 1.0 \
    --tree_decoding \
    --ngram_topk 20 \
    --penalty 1.2 \
    --penalty_length 1024 \
    --prompt_id 0

  <NOTE: Modify the data and model path>

Citation

BibTeX:

@misc{tokenswift,
      title={From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens}, 
      author={Tong Wu and Junzhe Shen and Zixia Jia and Yuxuan Wang and Zilong Zheng},
      year={2025},
      eprint={2502.18890},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18890}, 
}

APA:

Wu, T., Shen, J., Jia, Z., Wang, Y., & Zheng, Z. (2025). From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens. arXiv preprint arXiv:2502.18890.