TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

TextFlux is an OCR-free framework using a Diffusion Transformer (DiT, based on FLUX.1-Fill-dev) for high-fidelity multilingual scene text synthesis. It simplifies the learning task by providing direct visual glyph guidance through spatial concatenation of rendered glyphs with the scene image, enabling the model to focus on contextual reasoning and visual fusion.

Key Features

OCR-Free: Simplified architecture without OCR encoders.
High-Fidelity & Contextual Styles: Precise rendering, stylistically consistent with scenes.
Multilingual & Low-Resource: Strong performance across languages, adapts to new languages with minimal data (e.g., <1,000 samples).
Zero-Shot Generalization: Renders characters unseen during training.
Controllable Multi-Line Text: Flexible multi-line synthesis with line-level control.
Data Efficient: Uses a fraction of data (e.g., ~1%) compared to other methods.

Updates

2025/08/02: Our full param TextFlux-beta weights and TextFlux-LoRA-beta weights are now available! Single-line text generation accuracy performance could be significantly enhanced by 10.9% and 11.2% respectively 👋!
2025/08/02: Our Training Datasets and Testing Datasets are now available 👋!
2025/08/01: Our Eval Scripts are now available 👋!
2025/05/27: Our Full-Param Weights and LoRA Weights are now available 👋!
2025/05/25: Our Paper on ArXiv is available 👋!

TextFlux-beta

We are excited to release TextFlux-beta and TextFlux-LoRA-beta, new versions of our model specifically optimized for single-line text editing.

Key Advantages

Significantly improves the quality of single-line text rendering.
Increases inference speed for single-line text by approximately 1.4x.
Dramatically enhances the accuracy of small text synthesis.

How It Works

Considering that single-line editing is a primary use case for many users and generally yields more stable, high-quality results, we have released new weights optimized for this scenario.

Unlike the original model which renders glyphs onto a full-size mask, the beta version utilizes a single-line image strip for the glyph condition. This approach not only reduces unnecessary computational overhead but also provides a more stable and high-quality supervisory signal. This leads directly to the significant improvements in both single-line and small text rendering (see example here).

To use these new models, please refer to the updated files: demo.py, run_inference.py, and run_inference_lora.py. While the beta models retain the ability to generate multi-line text, we highly recommend using them for single-line tasks to achieve the best performance and stability.

Performance

This table shows that the TextFlux-beta model achieves a significant performance improvement of approximately 11 points in single-line text editing, while also boosting inference speed by 1.4 times compared to previous versions! The AMO Sampler contributed approximately 3 points to this increase. The test dataset is ReCTS editing.

Method	SeqAcc-Editing (%)↑	NED (%)↑	FID ↓	LPIPS ↓	Inference Speed (s/img)↓
TextFlux-LoRA	37.2	58.2	4.93	0.063	16.8
TextFlux	40.6	60.7	4.84	0.062	15.6
TextFlux-LoRA-beta	48.4	70.5	4.69	0.062	12.0
TextFlux-beta	51.5	72.9	4.59	0.061	10.9

Setup

Clone/Download: Get the necessary code and model weights.
Dependencies:

git clone https://github.com/yyyyyxie/textflux.git
cd textflux
conda create -n textflux python==3.11.4 -y
conda activate textflux
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install -r requirements.txt
cd diffusers
pip install -e .
# Ensure gradio == 3.50.1

Gradio Demo

Provides "Custom Mode" (upload scene image, draw masks, input text for automatic template generation) and "Normal Mode" (for pre-combined inputs).

# Ensure gradio == 3.50.1
python demo.py

Training

This guide provides instructions for training and fine-tuning the TextFlux models.

Multi-line Training (Reproducing Paper Results)

Follow these steps to reproduce the multi-line text generation results from the original paper.

Prepare the Dataset Download the Multi-line dataset and organize it using the following directory structure:

|- ./datasets
   |- multi-lingual
   |  |- processed_mlt2017
   |  |- processed_ReCTS_train_images
   |  |- processed_totaltext
   |  ....

Run the Training Script Execute the appropriate training script. The train.sh script is for standard training, while train_lora.sh is for training with LoRA.
```
# For standard training
bash scripts/train.sh
```
or
```
# For LoRA training
bash scripts/train_lora.sh
```
Note: Ensure you are using the commands and configurations within the script designated for multi-line training.

Single-line Training

To create our TextFlux beta weights optimized for the single-line task, we fine-tuned our pre-trained multi-line models. Specifically, we loaded the weights from the TextFlux and TextFLux-LoRA models and continued training for an additional 10,000 steps on a single-line dataset.

If you wish to replicate this process, you can follow these steps:

Prepare the Dataset First, download the Single-line dataset and arrange it as follows:

|- ./datasets
   |- anyword
   |  |- ReCTS
   |  |- TotalText
   |  |- ArT
   |  ...
   ....

Run the Fine-tuning Script Ensure your script is configured to load the weights from a pre-trained multi-line model, and then execute the fine-tuning command.
```
# For standard fine-tuning
bash scripts/train.sh
```
or
```
# For LoRA fine-tuning
bash scripts/train_lora.sh
```

Evaluation

First, use the scripts/batch_eval.sh script to perform batch inference on the images in the test set.

bash scripts/batch_eval.sh

Once inference is complete, use eval/eval_ocr.sh to evaluate the OCR accuracy and eval/eval_fid_lpips.sh to evaluate FID and LPIPS scores.

bash eval/eval_ocr.sh

bash eval/eval_fid_lpips.sh

TODO

Release the training datasets and testing datasets
Release the training scripts
Release the eval scripts
Support comfyui

Acknowledgement

Our code is modified based on Diffusers. We adopt FLUX.1-Fill-dev as the base model. Thanks to all the contributors for the helpful discussions! We also sincerely thank the contributors of the following code repositories for their valuable contributions: AnyText, AMO.

Citation

@misc{xie2025textfluxocrfreeditmodel,
      title={TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis}, 
      author={Yu Xie and Jielei Zhang and Pengyu Chen and Ziyue Wang and Weihang Wang and Longwen Gao and Peiyi Li and Huyang Sun and Qiang Zhang and Qian Qiao and Jiaqing Fan and Zhouhui Lian},
      year={2025},
      eprint={2505.17778},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.17778}, 
}

yyyyyxie
/

textflux-lora-beta