LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
Abstract
LAMIC, a Layout-Aware Multi-Image Composition framework, extends single-reference diffusion models to multi-reference scenarios using attention mechanisms, achieving state-of-the-art performance in controllable image synthesis without training.
In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- IC-Custom: Diverse Image Customization via In-Context Learning (2025)
- PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement (2025)
- ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation (2025)
- CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step (2025)
- PositionIC: Unified Position and Identity Consistency for Image Customization (2025)
- FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers (2025)
- Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper