metadata

library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
tags:
  - long-sequence-generation
  - ultra-long-sequence
  - lossless-acceleration

Model Card for TokenSwift Models

TokenSwift is a novel framework designed to substantially accelerate the generation process of ultra-long sequences (up to 100K tokens) while maintaining the target model's inherent quality. This model significantly reduces generation time, offering lossless acceleration for long sequences. It's based on the research presented in From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens.

Model Details

This model is a framework for accelerating ultra-long sequence generation. It is designed to be compatible with various Large Language Models (LLMs) via the HuggingFace Transformers library.

Developed by: BIGAI-NLCO
Model type: Long Sequence Generation Accelerator
Language(s) (NLP): Multiple (supports models trained on various languages)
License: Apache-2.0
Finetuned from model [optional]: (Model-specific; varies based on the base model used)

Model Sources

Repository: https://github.com/bigai-nlco/TokenSwift
Paper: https://arxiv.org/abs/2502.18890

Uses

TokenSwift is designed to accelerate the generation of ultra-long sequences (up to 100K tokens) for various LLMs. Its key benefit is lossless acceleration, meaning that the generated text quality is identical to that of the underlying LLM.

Direct Use

TokenSwift acts as a wrapper around existing LLMs, speeding up their generation process. It's designed to be easily integrated into existing workflows. See examples below.

How to Get Started with the Model

Install: Follow the installation instructions in the GitHub repository.
Download: Choose a pre-trained TokenSwift model from the Hugging Face Model Hub ([link to models on HuggingFace]).
Inference: Use the provided inference script in the repository, adapting the parameters to your needs. See the GitHub README for detailed instructions and examples.

Example using transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "TokenSwift/TokenSwift-Llama-3.1-8B" # Replace with the actual model path
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)

prompt = "The key to success is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

generated_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    # Add other generation parameters as needed
)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)

Example using TokenSwift command line:

torchrun  --master-port 1111 --nproc_per_node=1 main.py \
    --model_type llama3_1 \
    --ckpt_path your_checkpoint_path \
    --prefill_len 4096 \
    --retrival_max_budget 4096 \
    --gen_len 102400 \
    --gamma 4 \
    --min_p 0.1 \
    --temperature 1.0 \
    --tree_decoding \
    --ngram_topk 20 \
    --penalty 1.2 \
    --penalty_length 1024 \
    --prompt_id 0

  <NOTE: Modify the data and model path>

Citation

BibTeX:

@misc{tokenswift,
      title={From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens}, 
      author={Tong Wu and Junzhe Shen and Zixia Jia and Yuxuan Wang and Zilong Zheng},
      year={2025},
      eprint={2502.18890},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18890}, 
}

APA:

Wu, T., Shen, J., Jia, Z., Wang, Y., & Zheng, Z. (2025). From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens. arXiv preprint arXiv:2502.18890.