--- library_name: transformers base_model: - nllg/detikzify-v2-8b --- # Model Card for DeTi*k*Zify_v2.5 (8b) DeTi*k*Zify_v2.5 (8b) is a multimodal language model that automatically converts sketches and existing scientific figures into editable, semantics-preserving Ti*k*Z graphics programs. It builds on [DeTi*k*Zify_v2](https://huggingface.co/nllg/detikzify-v2-8b) post-trained with reinforcement learning and self-computed rewards. This approach, which we call reinforcement learning from self-feedback (RLSF), allows the model to considerably improve itself without necessitating external reward functions. Check out the [DeTi*k*Zify](https://github.com/potamides/DeTikZify) project for more information and tips on how to best run the model. ## Usage ```python from operator import itemgetter from detikzify.model import load from detikzify.infer import DetikzifyPipeline image = "https://w.wiki/A7Cc" pipeline = DetikzifyPipeline(*load( model_name_or_path="nllg/detikzify-v2.5-8b", device_map="auto", torch_dtype="bfloat16", )) # generate a single TikZ program fig = pipeline.sample(image=image) # if it compiles, rasterize it and show it if fig.is_rasterizable: fig.rasterize().show() # run MCTS for 10 minutes and generate multiple TikZ programs figs = set() for score, fig in pipeline.simulate(image=image, timeout=600): figs.add((score, fig)) # save the best TikZ program best = sorted(figs, key=itemgetter(0))[-1][1] best.save("fig.tex") ``` ## Reinforcement Learning from Self-Feedback ### Background DeTi*k*Zify employs an iterative inference algorithm based on [Monte Carlo Tree Search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) (MCTS), enabling it to continuously refine its outputs without additional training. The reward scores required by MCTS are computed entirely using DeTi*k*Zify's vision encoder, by visually assessing the similarity between input figures and compiled generated outputs. External reward models are not only unnecessary but often have lower correlations with human judgments, as the vision encoder was fine-tuned end-to-end with the entire model, optimizing it for evaluating this specific task. We refer readers to the [DeTi*k*Zify](https://arxiv.org/abs/2405.15306) and [Ti*k*Zero](https://arxiv.org/abs/2503.11509) papers for further details. These self-computed rewards have been effective in enhancing model outputs during *inference*. With reinforcement learning algorithms like [Group Relative Policy Optimization](https://arxiv.org/abs/2402.03300), this reward signal could also be used for the model to improve itself during a *post-training step*, i.e., reinforcement learning from self-feedback (RLSF). ### Model Training Our post-training setup does only require figures, and does not require aligned code as in supervised fine-tuning, granting us more flexibility in selecting training data. 50% of the training data comes from the subset of [DaTikZ_v3](https://huggingface.co/datasets/nllg/datikz-v3), that was filtered out during the training of DeTi*k*Zify_v2. The remaining 50% is sampled from the [SPIQA](https://huggingface.co/datasets/google/spiqa) dataset, which contains image labels for figures extracted from [arXiv](https://arxiv.org). We exclude all figures from papers included in DaTikZ_v3. We sample this split so that 60% of these figures are labeled as schematics, 20% as plots, and 20% come from other categories. Since these figures are not necessarily created from Ti*k*Z, they may aid in enhancing the model's generalization capabilities. As with DeTi*k*Zify_v2, input figures are randomly converted into synthetic sketches using image transformations and [UltraSketch](https://huggingface.co/nllg/ultrasketch). Using this dataset, we post-train [DeTi*k*Zify_v2](https://huggingface.co/nllg/detikzify-v2-8b) with RLSF, employing a batch size of 16. For each image, 32 outputs are generated, resulting in the model being trained on 512 outputs per step. We train for a total of 500 steps which takes 5 days to complete on eight Nvidia H200 GPUs. We keep the vision encoder frozen to mitigate reward hacking. ### Experiments and Results We evaluate DeTi*k*Zify_v2.5 (8b) on the test split of DaTi*k*Z_v3 and compare it to DeTi*k*Zify_v2 (8b). The metrics employed include DreamSim (DSim), Kernel Inception Distance (KID), CrystalBLEU (cBLEU), TeX Edit Distance (TED), Mean Token Efficiency (MTE), and Mean Sampling Throughput (MST). Refer to the [DeTi*k*Zify](https://arxiv.org/abs/2405.15306) paper for further details. All scores except MST are multiplied by 100. #### Sampling-based Inference

	Reference Figures					Synthetic Sketches
Model	DSim_↑	KID_↓	cBLEU_↑	TED_↓	MTE_↑	DSim_↑	KID_↓	cBLEU_↑	TED_↓	MTE_↑
DeTikZify_v2 (8b)	80.503	0.626	6.105	54.946	93.326	74.584	0.751	3.356	58.32	93.858
DeTikZify_v2.5 (8b)	84.6438	0.298	4.202	52.939	100	78.257	0.577	1.551	56.121	100

In sampling-based inference (i.e., accepting the first output that compiles successfully) using reference figures and synthetic sketch inputs, DeTi*k*Zify_v2.5 (8b) outperforms DeTi*k*Zify_v2 (8b) on most metrics, demonstrating that RLSF can effectively enhance performance. The considerably increased DreamSim scores indicate that DeTi*k*Zify_v2.5 (8b) generates outputs that are much more visually similar to the reference figures. Furthermore, it is much less likely to produce outputs that do not compile, as evidenced by its perfect MTE score. Interestingly, while it scores lower on the code-based metric CrystalBLEU, it performs better on the code-based TED. DeTi*k*Zify_v2.5 (8b) tends to generate more concise programs with less syntactic noise. While this likely reduces the n-gram overlap with the reference code, it also decreases the number of edits necessary to convert one into another, explaining this phenomenon. Generally, more concise programs are beneficial as long as the semantics are preserved. #### MCTS-based Inference

	Reference Figures					Synthetic Sketches
Model	DSim_↑	KID_↓	cBLEU_↑	TED_↓	MST_↑	DSim_↑	KID_↓	cBLEU_↑	TED_↓	MST_↑
DeTikZify_v2 (8b)	89.020	0.016	6.593	52.466	52.723	81.482	0.313	3.344	56.405	53.586
DeTikZify_v2.5 (8b)	90.889	-0.047	4.646	51.824	68.12	83.74	0.61	1.976	55.239	78.908

We observe similar trends when using our MCTS-based inference algorithm with a time budget of 10 minutes. Compared to sampling-based inference, DeTi*k*Zify_v2.5 (8b) noticeably improves its scores, illustrating that MCTS on top of RLSF can still lead to additional gains. Additionally, within the same timeframe, DeTi*k*Zify_v2.5 (8b) generates 25 more outputs than DeTi*k*Zify_v2 (8b), supporting our hypothesis that the generated programs are more concise. On reference figures, DeTi*k*Zify_v2.5 (8b) scores better on both DreamSim and KID, with the KID score even being slightly negative due to the high similarity of distributions. For synthetic sketches, it achieves a higher DreamSim score but performs worse on KID, indicating a prioritization of faithfulness to the reference figure over just focusing on general aesthetics. #### Inference with Ti*k*Zero Adapters

	Captions
Model	DSim_↑	KID_↓	CLIP_↑	cBLEU_↑	TED_↓	MTE_↑
DeTikZify_v2 (8b)	52.829	5.103	10.051	1.603	65.51	82.291
DeTikZify_v2.5 (8b)	53.564	7.471	7.968	0.732	62.189	100

[Ti*k*Zero adapters](https://huggingface.co/nllg/tikzero-adapter) integrate into the vision encoder of DeTi*k*Zify models, enabling them to be conditioned on text in addition to images. Since we keep the vision encoder frozen, we can evaluate DeTi*k*Zify_v2.5 (8b) on adapters trained for DeTi*k*Zify_v2 (8b). Compared to our previous experiments, the results are more varied. While DeTi*k*Zify_v2.5 (8b) achieves a better DreamSim value and maintains a perfect MTE, it performs worse on CLIPScore, suggesting difficulties in reproducing text from captions. This could be due to an increased modality gap, as RLSF further refines the model for image-only inputs. We plan to address this in future work by incorporating caption inputs into RLSF training. ### Summary Overall, RLSF greatly enhances model performance for most tasks. For image and sketch inputs, DeTi*k*Zify_v2.5 (8b) emerges as the clear leader. For text inputs via Ti*k*Zero adapters, the choice between model versions depends on the specific use case, given the trade-offs involved. ### Acknowledgments This model was trained using computational resources provided by the bwForCluster Helix, as part of the bwHPC-S5 project. The authors acknowledge support from the state of Baden-Württemberg through the bwHPC initiative and the German Research Foundation (DFG) under grant INST 35/1597-1 FUGG. This project was inspired by the paper [Rendering-Aware Reinforcement Learning for Vector Graphics Generation](https://arxiv.org/abs/2505.20793).