TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

English | ไธญๆ–‡็ฎ€ไฝ“

TextFlux is an OCR-free framework using a Diffusion Transformer (DiT, based on FLUX.1-Fill-dev) for high-fidelity multilingual scene text synthesis. It simplifies the learning task by providing direct visual glyph guidance through spatial concatenation of rendered glyphs with the scene image, enabling the model to focus on contextual reasoning and visual fusion.

Key Features

  • OCR-Free: Simplified architecture without OCR encoders.
  • High-Fidelity & Contextual Styles: Precise rendering, stylistically consistent with scenes.
  • Multilingual & Low-Resource: Strong performance across languages, adapts to new languages with minimal data (e.g., <1,000 samples).
  • Zero-Shot Generalization: Renders characters unseen during training.
  • Controllable Multi-Line Text: Flexible multi-line synthesis with line-level control.
  • Data Efficient: Uses a fraction of data (e.g., ~1%) compared to other methods.

Updates

TextFlux-beta

We are excited to release TextFlux-beta and TextFlux-LoRA-beta, new versions of our model specifically optimized for single-line text editing.

Key Advantages

  • Significantly improves the quality of single-line text rendering.
  • Increases inference speed for single-line text by approximately 1.4x.
  • Dramatically enhances the accuracy of small text synthesis.

How It Works

Considering that single-line editing is a primary use case for many users and generally yields more stable, high-quality results, we have released new weights optimized for this scenario.

Unlike the original model which renders glyphs onto a full-size mask, the beta version utilizes a single-line image strip for the glyph condition. This approach not only reduces unnecessary computational overhead but also provides a more stable and high-quality supervisory signal. This leads directly to the significant improvements in both single-line and small text rendering (see example here).

To use these new models, please refer to the updated files: demo.py, run_inference.py, and run_inference_lora.py. While the beta models retain the ability to generate multi-line text, we highly recommend using them for single-line tasks to achieve the best performance and stability.

Performance

This table shows that the TextFlux-beta model achieves a significant performance improvement of approximately 11 points in single-line text editing, while also boosting inference speed by 1.4 times compared to previous versions! The AMO Sampler contributed approximately 3 points to this increase. The test dataset is ReCTS editing.

Method SeqAcc-Editing (%)โ†‘ NED (%)โ†‘ FID โ†“ LPIPS โ†“ Inference Speed (s/img)โ†“
TextFlux-LoRA 37.2 58.2 4.93 0.063 16.8
TextFlux 40.6 60.7 4.84 0.062 15.6
TextFlux-LoRA-beta 48.4 70.5 4.69 0.062 12.0
TextFlux-beta 51.5 72.9 4.59 0.061 10.9

Setup

  1. Clone/Download: Get the necessary code and model weights.

  2. Dependencies:

git clone https://github.com/yyyyyxie/textflux.git
cd textflux
conda create -n textflux python==3.11.4 -y
conda activate textflux
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install -r requirements.txt
cd diffusers
pip install -e .
# Ensure gradio == 3.50.1

Gradio Demo

Provides "Custom Mode" (upload scene image, draw masks, input text for automatic template generation) and "Normal Mode" (for pre-combined inputs).

# Ensure gradio == 3.50.1
python demo.py 

Training

This guide provides instructions for training and fine-tuning the TextFlux models.


Multi-line Training (Reproducing Paper Results)

Follow these steps to reproduce the multi-line text generation results from the original paper.

  1. Prepare the Dataset Download the Multi-line dataset and organize it using the following directory structure:

    |- ./datasets
       |- multi-lingual
       |  |- processed_mlt2017
       |  |- processed_ReCTS_train_images
       |  |- processed_totaltext
       |  ....
    
  2. Run the Training Script Execute the appropriate training script. The train.sh script is for standard training, while train_lora.sh is for training with LoRA.

    # For standard training
    bash scripts/train.sh
    

    or

    # For LoRA training
    bash scripts/train_lora.sh
    

    Note: Ensure you are using the commands and configurations within the script designated for multi-line training.


Single-line Training

To create our TextFlux beta weights optimized for the single-line task, we fine-tuned our pre-trained multi-line models. Specifically, we loaded the weights from the TextFlux and TextFLux-LoRA models and continued training for an additional 10,000 steps on a single-line dataset.

If you wish to replicate this process, you can follow these steps:

  1. Prepare the Dataset First, download the Single-line dataset and arrange it as follows:

    |- ./datasets
       |- anyword
       |  |- ReCTS
       |  |- TotalText
       |  |- ArT
       |  ...
       ....
    
  2. Run the Fine-tuning Script Ensure your script is configured to load the weights from a pre-trained multi-line model, and then execute the fine-tuning command.

    # For standard fine-tuning
    bash scripts/train.sh
    

    or

    # For LoRA fine-tuning
    bash scripts/train_lora.sh
    

Evaluation

First, use the scripts/batch_eval.sh script to perform batch inference on the images in the test set.

bash scripts/batch_eval.sh

Once inference is complete, use eval/eval_ocr.sh to evaluate the OCR accuracy and eval/eval_fid_lpips.sh to evaluate FID and LPIPS scores.

bash eval/eval_ocr.sh
bash eval/eval_fid_lpips.sh

TODO

  • Release the training datasets and testing datasets
  • Release the training scripts
  • Release the eval scripts
  • Support comfyui

Acknowledgement

Our code is modified based on Diffusers. We adopt FLUX.1-Fill-dev as the base model. Thanks to all the contributors for the helpful discussions! We also sincerely thank the contributors of the following code repositories for their valuable contributions: AnyText, AMO.

Citation

@misc{xie2025textfluxocrfreeditmodel,
      title={TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis}, 
      author={Yu Xie and Jielei Zhang and Pengyu Chen and Ziyue Wang and Weihang Wang and Longwen Gao and Peiyi Li and Huyang Sun and Qiang Zhang and Qian Qiao and Jiaqing Fan and Zhouhui Lian},
      year={2025},
      eprint={2505.17778},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.17778}, 
}
Downloads last month
56
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yyyyyxie/textflux-lora-beta

Finetuned
(28)
this model