M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding Paper • 2411.04952 • Published Nov 7, 2024 • 30
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models Paper • 2411.05005 • Published Nov 7, 2024 • 13
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models Paper • 2411.04075 • Published Nov 6, 2024 • 17
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems Paper • 2411.02959 • Published Nov 5, 2024 • 71
The Lessons of Developing Process Reward Models in Mathematical Reasoning Paper • 2501.07301 • Published Jan 13 • 99
Evaluating Sample Utility for Data Selection by Mimicking Model Weights Paper • 2501.06708 • Published Jan 12 • 5
MiniMax-01: Scaling Foundation Models with Lightning Attention Paper • 2501.08313 • Published Jan 14 • 286
3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering Paper • 2501.05131 • Published Jan 9 • 37
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks Paper • 2501.08326 • Published Jan 14 • 35
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them Paper • 2501.08292 • Published Jan 14 • 17
FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation Paper • 2502.01068 • Published Feb 3 • 16
Reward-Guided Speculative Decoding for Efficient LLM Reasoning Paper • 2501.19324 • Published Jan 31 • 39
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation Paper • 2501.17433 • Published Jan 29 • 9
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction Paper • 2504.01014 • Published 23 days ago • 64
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step Paper • 2504.01956 • Published 22 days ago • 40
Towards Physically Plausible Video Generation via VLM Planning Paper • 2503.23368 • Published 26 days ago • 39
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning Paper • 2504.07960 • Published 14 days ago • 46
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation Paper • 2504.08736 • Published 13 days ago • 47
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs Paper • 2504.11536 • Published 9 days ago • 58
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference Paper • 2504.10326 • Published 10 days ago • 25
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers Paper • 2504.10483 • Published 10 days ago • 20
Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution Paper • 2504.09566 • Published 12 days ago • 10
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Paper • 2504.13161 • Published 7 days ago • 86
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling Paper • 2504.13169 • Published 7 days ago • 38
InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework Paper • 2504.12395 • Published 8 days ago • 17
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float Paper • 2504.11651 • Published 9 days ago • 15
Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark Paper • 2504.13143 • Published 7 days ago • 8
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization Paper • 2503.19901 • Published about 1 month ago • 39
Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals Paper • 2503.19953 • Published about 1 month ago • 3