\pdfcolInitStack

tcb@breakable

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

Yuyao ZHANG
Dartmouth College
yuyao.zhang.gr@dartmouth.edu
&Jinghao LI
CUHK
jhli4@cse.cuhk.edu.hk
&Yu-Wing TAI
Dartmouth College
yu-wing.tai@dartmouth.edu
Abstract

Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present LayerCraft, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) structured generation from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) layered object integration, allowing users to insert and customize objects—such as characters or props—across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the ChainArchitect for CoT-driven layout planning, and the Object Integration Network (OIN) for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.

Refer to caption
Figure 1: Application demonstrations for LayerCraft. Left: Demonstrates batch collage editing capabilities. A user uploads graduation photos and LayerCraft seamlessly integrates a graduation bear across all images. The system first generates a reference bear for consistency, then analyzes optimal placement while preserving facial identity and background integrity. Right: Illustrates the structured text-to-image generation process. From a simple "Alice in Wonderland" prompt, LayerCraft employs chain-of-thought reasoning to sequentially generate background elements, determine object layout, and compose the final image. The framework supports post-generation customization, as shown with the lion integration.

1 Introduction

Text-to-image (T2I) generation has rapidly evolved with advances in diffusion models ho2020denoising ; Rombach_2022_CVPR ; podell2023sdxl , transformer-based architectures vaswani2017attention , and scalable encoder-decoder frameworks ronneberger2015u . Recent systems esser2024scaling ; videoworldsimulators2024 ; chen2024pixart produce visually impressive results from simple prompts. However, they still fall short in offering precise, intuitive control over spatial composition, multi-object interactions, and iterative customization.

Existing approaches to fine-grained T2I control often require architectural modifications or fine-tuning zhang2023adding ; ye2023ip ; ruiz2023dreambooth , which limits generality and usability. Others support instance-level manipulation wang2024instancediffusion ; xie2023boxdiff ; kim2023dense , but often falter in complex scenes or suffer from spatial inconsistency. More structured methods like LayoutGPT feng2023layoutgpt and GenArtist wang2025genartist attempt procedural generation, but neglect 3D spatial reasoning or rely on inefficient pipelines with excessive external tools. Even advanced multi-modal agents like GPT-4o111https://openai.com/index/gpt-4o-system-card/ fail to maintain background consistency or facial identity over multiple editing iterations.

LayerCraft is our answer to these limitations: a fully automatic, modular framework for structured T2I generation and editing, designed to balance expressive control, compositional accuracy, and system efficiency. LayerCraft treats image synthesis as a step-by-step reasoning process, orchestrated by a team of specialized agents that handle prompt interpretation, spatial planning, and object integration. As shown in Figure 1, our framework supports applications such as batch collage editing with consistent object insertion, and narrative-driven image generation using structured reasoning and layout planning.

  • LayerCraft Coordinator serves as the central interface, managing interactions between users and agents. It processes instructions, coordinates agent outputs, and integrates user feedback throughout the generation process.

  • ChainArchitect performs chain-of-thought (CoT) reasoning to decompose prompts into structured layout plans. It first generates the background, then infers a spatial layout, represented as a dependency-aware 3D scene graph, to determine bounding boxes and relationships among objects. This planning phase supports complex multi-object scenes and facilitates layer-wise, editable image construction.

  • Object Integration Network (OIN) uses the original FLUX flux2024 T2I model to seamlessly inpaint objects into specific regions. By applying dual LoRA adapters, OIN integrates both background and reference conditions while preserving generative quality. Its attention-mixing mechanism ensures that inserted objects align contextually and stylistically with the base image.

LayerCraft introduces several advantages over prior work: (1) it eliminates the need for model fine-tuning or external tools, making it accessible and lightweight; (2) it offers interpretable, spatially aware image construction via CoT-guided layout planning; and (3) it supports consistent object editing across single or multiple images without sacrificing visual quality. Compared to LayoutGPT feng2023layoutgpt and GenArtist wang2025genartist , which struggle with spatial coherence and integration complexity, LayerCraft provides a unified, agent-based framework capable of general-purpose generation and editing.

Our experiments demonstrate that LayerCraft excels in various creative workflows, from narrative scene composition to iterative and batch image editing, empowering both experts and non-experts to produce controllable, high-quality images with minimal effort.

2 Related Work

Controllable Image Generation. Text-to-image (T2I) generation has seen rapid progress, led by advances in diffusion models—from pixel-space methods like GLIDE nichol2021glide and Imagen saharia2022photorealistic to more efficient latent-space frameworks such as Stable Diffusion Rombach_2022_CVPR and Raphael xue2024raphael . Enhancements in multimodal alignment (e.g., DALLE-2 ramesh2022hierarchical , Playground li2024playground ) and architectural designs (e.g., Diffusion Transformers peebles2023scalable , PixArt chen2024pixart , FLUX flux2024 ) have substantially improved the quality and diversity of generated content. However, fine-grained and interpretable control remains challenging, especially in scenes with multiple objects or complex layouts. Personalization methods like DreamBooth ruiz2023dreambooth and Textual Inversion gal2022image support user-specific concepts but require task-specific fine-tuning. Structured control approaches such as ControlNet zhang2023adding and GLIGEN li2023gligen offer spatial conditioning via edge maps or boxes but rely on detailed inputs and lack high-level scene reasoning.

Recent techniques like Raphael xue2024raphael improve specialization through expert models at the cost of high computation. Lightweight alternatives like Attend chefer2023attend reduce overhead but struggle with compositional complexity. Autoregressive frameworks (e.g., LlamaGen sun2024autoregressive , Show-O xie2024showo , Janus-Pro chen2025janus ) explore prompt-based synthesis via language models but often lack spatial structure. Meanwhile, emerging MLLMs such as GPT-4o and Gemini 2.0 Flash222https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-2.0-flash-001?inv=1&invt=AbxbDg show generative promise, though they require substantial resources and offer limited layout control.

In contrast, LayerCraft introduces a modular, agent-based framework for structured multi-object generation and editing with minimal user input. The ChainArchitect employs chain-of-thought (CoT) reasoning to produce interpretable, 3D-aware layouts, enabling precise spatial planning without manual annotations. The Object Integration Network (OIN) leverages dual-LoRA fine-tuning on a pre-trained diffusion model (FLUX) to enable parameter-efficient object integration with strong visual fidelity and contextual coherence. Unlike methods such as OminiControl tan2024ominicontrol , which emphasize attention-based spatial aligned and subject driven generation, LayerCraft supports broader workflows, including image-guided inpainting, iterative editing, and batch collage generation, within a unified, lightweight pipeline that generalizes effectively across diverse scenarios.

Agent-Based Generation. The rise of large language models (LLMs) has greatly advanced zero- and few-shot learning across diverse domains achiam2023gpt ; team2023gemini . With multimodal training alayrac2022flamingo ; liu2023visual ; zhu2023minigpt , LLMs have evolved into powerful agents for reasoning and creative generation yang2024worldgpt ; wu2024motion . Among these, LayoutGPT feng2023layoutgpt uses LLMs to generate spatial layouts from text prompts. While effective for simple scenes, its reliance on static layout models limits its ability to handle complex prompts and spatial relationships, particularly due to the absence of multi-step reasoning. Other frameworks like GenArtist wang2025genartist and LLM Blueprints gani2024llm follow a “generate-then-edit” paradigm, refining initial layouts or images through external editing modules. This often leads to stylistic drift and unstable outputs, due to fragmented control and lack of shared context across steps.

In contrast, LayerCraft offers an integrated multi-agent framework that unifies layout planning and refinement, and iterative object integration. The LayerCraft Coordinator orchestrates agent interactions and incorporates user feedback throughout the process. The ChainArchitect improves upon LayoutGPT by applying chain-of-thought (CoT) reasoning to generate structured, 3D-aware layouts, enabling compositional planning without external layout tools. The Object Integration Network (OIN) complements this with image-guided inpainting via dual-LoRA fine-tuning on a pre-trained model, supporting adaptive, context-aware generation while maintaining high visual fidelity. Unlike modular pipelines that rely on third-party components or model modifications, LayerCraft remains self-contained and parameter-efficient, offering a robust and consistent user experience.

Refer to caption
Figure 2: LayerCraft is a framework with three key components: the LayerCraft Coordinator, which processes user instructions and manages collaboration; ChainArchitect, which enhances prompts to plan layouts, identify objects and relationships, and assign bounding boxes using Chain-of-Thought reasoning; and the Object Integration Network (OIN), which enables image-guided inpainting for seamless object integration using the LoRA fine-tuned FLUX model.

Chain of Thought Reasoning. Chain-of-thought (CoT) prompting has proven effective in improving language model reasoning by decomposing complex tasks into intermediate steps wei2022chain ; zhang2022automaticchainthoughtprompting . However, in multimodal settings, existing CoT approaches often rely on model finetuning over specialized datasets mondal2024kamcot ; zhang2024multimodalchainofthoughtreasoninglanguage , limiting their applicability in zero-shot or flexible generation scenarios.

LayerCraft takes a different approach by incorporating CoT reasoning without requiring additional fine-tuning. The LayerCraft Coordinator uses CoT to iteratively revise and enrich user prompts, while the ChainArchitect applies CoT-style decomposition to translate high-level instructions into structured, 3D-aware layouts. This allows LayerCraft to reason over complex spatial relationships and multi-object configurations in a fully zero-shot, training-free setting. By leveraging CoT within a modular agent framework, LayerCraft achieves interpretable, step-wise control in multimodal image generation, offering a robust alternative to methods that depend on task-specific finetuning or static layout templates.

3 Methodology

This section elaborates the detailed design of LayerCraft, overviewed in Figure 2. Leveraging GPT-4o as the central coordinator, LayerCraft enables self-monitoring, user-agent interaction, and aesthetically refined outputs and multi-turn editing. The framework consists of three main agents: (1) LayerCraft Coordinator, which processes user instructions and orchestrates agent collaboration; (2) ChainArchitect, a layout planning agent that generates backgrounds, assigns objects and their spatial relationships; and (3) Object Integration Network (OIN), which integrates objects seamlessly into the background based on given mask.

3.1 LayerCraft Coordinator

The LayerCraft Coordinator acts as the central orchestrator for the entire framework, overseeing the system’s operation, ensuring smooth user-agent interactions, and directing agent collaboration. This component also serves as the primary interface for user input, streamlining communication between the user and the system.

Agent-Agent Interaction The framework integrates multiple specialized agents, each responsible for a specific task such as content recognition, reference image generation, layout planning (ChainArchitect), and final image generation and inpainting (OIN). The Coordinator plays a crucial role in orchestrating these agents, breaking down tasks, assigning responsibilities, and ensuring effective communication between them. Since generative models can produce intermediate outputs with inherent randomness, the Coordinator rigorously checks the consistency of both textual and visual outputs. If discrepancies are detected, it formulates corrective measures and delegates the task to the appropriate agent for regeneration. This enables LayerCraft to ensure that the final output meets the user’s specifications.

User-Agent Interaction Although the system operates autonomously, users can modify or refine the output by interacting with the agents. For example, a user may request more details for a specific object or a customized layout for a particular region. The Coordinator facilitates multiple rounds of interaction, refining the image iteratively until the user’s requirements are fully met.

Chain-of-Thought (CoT) Enrichment. The Coordinator enhances generation by iteratively enriching the text prompt using a Chain-of-Thought (CoT) approach. Starting from the user’s input, it “asks itself” which objects should appear and how they should be arranged to meet user intent (see Figure 1, right). This reasoning produces detailed descriptions of background and foreground elements, filling gaps in the original prompt.

If the Coordinator determines the user’s prompt is already sufficiently detailed, it skips CoT reasoning and proceeds directly to task delegation. This adaptive strategy improves efficiency by avoiding unnecessary steps when the input is complete.

3.2 ChainArchitect

ChainArchitect advances traditional layout generation models (e.g., LayoutGPT feng2023layoutgpt ) by integrating Chain-of-Thought (CoT) reasoning to better handle complex prompts involving multiple objects and intricate spatial relationships.

Given a user input prompt Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which may range from detailed to brief, the LLM identifies relevant objects and generates a structured list O={Oii}𝑂conditional-setsubscript𝑂𝑖𝑖O=\{O_{i}\mid i\in\mathbb{N}\}italic_O = { italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ blackboard_N } alongside a background description Pbisubscript𝑃subscript𝑏𝑖P_{b_{i}}italic_P start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. For instance, if the prompt mentions a “car,” ChainArchitect infers a suitable context such as a “road.” The background description Pbisubscript𝑃subscript𝑏𝑖P_{b_{i}}italic_P start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is passed to the FLUX model by the Coordinator to generate the background image Ibgsubscript𝐼𝑏𝑔I_{bg}italic_I start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT, which serves as a spatial reference for placing foreground objects.

To ensure the generated layout follows a consistent, interpretable format (e.g., JSON), ChainArchitect uses in-context exemplars333See supplementary materials that define object classes, spatial positions, and scene style, thereby aligning the output with the user’s intent. Additionally, ChainArchitect leverages GPT-4o’s vision capabilities to analyze the background image viewpoint, improving object placement accuracy.

For foreground objects, ChainArchitect performs explicit spatial reasoning: it determines an optimal generation order (placing distant objects before closer ones to manage occlusion) and models inter-object relationships, such as relative positioning (“A is on top of B”) and orientation (“Person A is facing left”). This structured reasoning enables coherent and realistic multi-object layouts even in complex scenes.

Refer to caption
Figure 3: Architecture of the Object Integration Network (OIN). The system processes a text prompt, a background image with a designated bounding box, and a reference object to produce a seamlessly integrated result. Red, yellow, and blue indicators represent the utilization of combined LoRA weights, background inpainting weights, and subject-driven generation weights respectively. “FF” and “MM Attn” denote feedforward layers and multi-modal attention layer in the FLUX model.

3.3 Object-Integration Network (OIN)

The Object Integration Network (OIN) facilitates the seamless incorporation of objects into pre-existing backgrounds, as illustrated in Figure 3. OIN processes a masked background (delineated by a bounding box), a reference object image, and a text prompt to synthesize a contextually coherent and visually consistent integration of the specified object into the background environment.

A Parameter Reuse Method for Multiple Conditional Generation via Dual LoRA Leveraging the robust pretrained capabilities of the FLUX text-image model on text-to-image task, we implement a parameter-efficient adaptation methodology for conditional generation. This approach enables the framework to process masked backgrounds and reference object images for highly precise subject-driven inpainting. Our implementation follows a two-phase training protocol:

In the initial phase, we develop two independent LoRA adaptors—Wbgsubscript𝑊𝑏𝑔W_{bg}italic_W start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT and Wobjsubscript𝑊𝑜𝑏𝑗W_{obj}italic_W start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT—that enhance the model’s capacity to interpret conditional images for inpainting and subject-driven generation tasks. Following the technique proposed in OminiControl tan2024ominicontrol , we incorporate positional embeddings for background image tokens using encodings identical to the initial noise, while reference image tokens utilize biased embeddings to accommodate spatially aligned and unaligned processing requirements.

The second phase initializes the model with the trained LoRA modules, enabling comprehensive understanding of both background and object conditions (Cbgsubscript𝐶𝑏𝑔C_{bg}italic_C start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT and Cobjsubscript𝐶𝑜𝑏𝑗C_{obj}italic_C start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT). To circumvent the quadratic memory complexity associated with processing extensive token sequences and to maintain clarity in condition relationships, we bifurcate the latent sequence into two components: [CT,X,Cbg]subscript𝐶𝑇𝑋subscript𝐶𝑏𝑔[C_{T},X,C_{bg}][ italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_X , italic_C start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT ] for background processing and [CT,X,Cobj]subscript𝐶𝑇𝑋subscript𝐶𝑜𝑏𝑗[C_{T},X,C_{obj}][ italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_X , italic_C start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT ] for object integration. These components undergo parallel processing with query, key, and value projections utilizing distinct weight sets: MqkvWbothsuperscriptsubscript𝑀𝑞𝑘𝑣subscript𝑊𝑏𝑜𝑡M_{qkv}^{W_{both}}italic_M start_POSTSUBSCRIPT italic_q italic_k italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_b italic_o italic_t italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for joint conditions, Mqkvsubscript𝑀𝑞𝑘𝑣M_{qkv}italic_M start_POSTSUBSCRIPT italic_q italic_k italic_v end_POSTSUBSCRIPT for FLUX’s foundational weights, MqkvWinpsuperscriptsubscript𝑀𝑞𝑘𝑣subscript𝑊𝑖𝑛𝑝M_{qkv}^{W_{inp}}italic_M start_POSTSUBSCRIPT italic_q italic_k italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_i italic_n italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for inpainting-specialized LoRA weights, and MqkvWobjsuperscriptsubscript𝑀𝑞𝑘𝑣subscript𝑊𝑜𝑏𝑗M_{qkv}^{W_{obj}}italic_M start_POSTSUBSCRIPT italic_q italic_k italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for object-specific LoRA weights. This architecture generates dual query, key, and value outputs: [Q1,K1,V1]subscript𝑄1subscript𝐾1subscript𝑉1[Q_{1},K_{1},V_{1}][ italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] for background elements and [Q2,K2,V2]subscript𝑄2subscript𝐾2subscript𝑉2[Q_{2},K_{2},V_{2}][ italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] for object features.

The attention mechanism computes outputs through the following formulations:

[CT1,X1,Cbg]=Softmax(Q1K1Td)V1,superscriptsubscript𝐶𝑇1superscript𝑋1subscript𝐶𝑏𝑔Softmaxsubscript𝑄1superscriptsubscript𝐾1𝑇𝑑subscript𝑉1[C_{T}^{1},X^{1},C_{bg}]=\operatorname{Softmax}\left(\frac{Q_{1}K_{1}^{T}}{% \sqrt{d}}\right)V_{1},[ italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT ] = roman_Softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,
[CT2,X2,Cobj]=Softmax(Q2K2Td)V2,superscriptsubscript𝐶𝑇2superscript𝑋2subscript𝐶𝑜𝑏𝑗Softmaxsubscript𝑄2superscriptsubscript𝐾2𝑇𝑑subscript𝑉2[C_{T}^{2},X^{2},C_{obj}]=\operatorname{Softmax}\left(\frac{Q_{2}K_{2}^{T}}{% \sqrt{d}}\right)V_{2},[ italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT ] = roman_Softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
Output=[CT1+CT22,M(X1,X2),Cbg,Cobj],Outputsuperscriptsubscript𝐶𝑇1superscriptsubscript𝐶𝑇22𝑀superscript𝑋1superscript𝑋2subscript𝐶𝑏𝑔subscript𝐶𝑜𝑏𝑗\text{Output}=\left[\frac{C_{T}^{1}+C_{T}^{2}}{2},M(X^{1},X^{2}),C_{bg},C_{obj% }\right],Output = [ divide start_ARG italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG , italic_M ( italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_C start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT ] ,

where M(X1,X2)𝑀superscript𝑋1superscript𝑋2M(X^{1},X^{2})italic_M ( italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) denotes the replacement of the masked region’s latent sequence X1superscript𝑋1X^{1}italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT with X2superscript𝑋2X^{2}italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT according to the bounding box mask. This methodology preserves both the generative capabilities of the model and its interpretation of the respective conditions, as the resultant image is generated without LoRA layer activation while conditions are processed using their corresponding LoRA weights. Consequently, the training objective focuses on establishing the relationship between textual input and the conditional elements.

Discussion: OIN supports LayerCraft’s design of the usage of intermediate reference images. It also enables LayerCraft to be an extremely efficient pipeline because during reference image creation, we employ Flux as the primary generator; when reference images are unnecessary, the LayerCraft coordinator can selectively load inpainting LoRA weights. This approach stands in contrast to frameworks such as GenArtist, which necessitates more than ten external models, resulting in computational inefficiency due to model loading/unloading cycles and introducing stylistic inconsistencies stemming from distributional variances across different models. We will provide more details and analysis related to OIN in the supplementary materials.

4 Experiments

Refer to caption
Figure 4: Visual comparisons with state-of-the-art generic text-to-image generation models are presented. On the left, the prompts are annotated with distinct colors to highlight critical attributes and relationships.

Implementation Details We use OpenAI’s GPT-4o achiam2023gpt as the base LLM for both the LayerCraft Coordinator and ChainArchitect agent, with the temperature set to 0.1 to balance control and creativity. Our text-to-image backbone is FLUX.1-dev flux2024 , implemented via the Hugging Face Diffusers library diffusers .

The Object Integration Network (OIN) is built using Diffusers and PEFT, and trained with a batch size of 1 and gradient accumulation over 4 steps on 4 NVIDIA A6000 Ada GPUs (48GB each). We use a LoRA rank of 4 and enable gradient checkpointing for memory efficiency. OIN is trained for 20,000 iterations on a 50K subset of IPA300K, while OminiControl is fine-tuned for 50,000 iterations. Additional samples are drawn from the remaining dataset for qualitative evaluation.

Dataset Preparation (IPA300K) To ensure diversity, we use ChatGPT (via O1) to generate a list of 500 unique objects across various categories. For each object, we create 20 descriptive prompts with varying attributes. Following the procedure in tan2024ominicontrol , we generate 10 scene-level and 1 studio-level description per prompt to facilitate paired image generation using FLUX.1-dev with 4 random seeds. This results in paired images—one with the object in isolation and one within a complex scene. To obtain accurate object localization, we apply Grounding DINO liu2024grounding and SAM 2 ravi2024sam to extract bounding boxes from the scene images. Additional image pairs are generated with smaller object sizes to reflect realistic subject-driven inpainting cases in our framework. Bounding boxes are expanded by 15% at the bottom and 10% on each side to reduce the impact of shadows or reflections. After filtering mismatched pairs using LLM-based validation, we obtain a final dataset of 300,000 high-quality pairs, which we name Image-guided inPainting Assets (IPA300K). The dataset will be released on HuggingFace.

Refer to caption
Figure 5: More example usage of LayerCraft. We can see that our model can generate results with consistent background, and object identity comparing to GPT-4o. It also illustrates the importance of pipeline’s design with OIN and intermediate reference images. For GenArtist, even if we provide the grouth truth bounding boxes and extra instructions, they still failed.

4.1 Visual Comparison with State-of-the-Art Methods

Figure 4 provides a qualitative comparison of our LayerCraft framework against a diverse set of state-of-the-art baselines, including generic diffusion models blackforestlabs2024flux1dev ; betker2023improving ; esser2024scaling , agent-based approaches wang2025genartist ; gani2024llm , autoregressive models xie2024showo ; chen2025janus , and GPT-4o. We crafted prompts that vary in object attributes, quantities, and spatial configurations to rigorously evaluate each method’s ability to interpret and realize complex scene compositions.

Our method consistently outperforms competitors by accurately capturing both the object counts and their spatial arrangements. For example, when prompted to generate two apples positioned farther away and four apples closer to the viewpoint, LayerCraft faithfully reproduces the specified quantity and spatial layout. In contrast, while Stable Diffusion 3.5 and FLUX.1-Dev produce the correct number of apples, they fail to preserve the intended spatial relationships. GPT-4o also struggles with correct object counting, and models like PixArt-α𝛼\alphaitalic_α and DALL·E 3 frequently generate incorrect object counts. Furthermore, FLUX.1-Schnell and Show-o exhibit notable errors across multiple dimensions, including color, positioning, and object consistency.

Additional visual comparisons are included in the supplementary materials. Figure 5 demonstrates LayerCraft’s effectiveness in editing collage photos via a single prompt. Compared to GPT-4o, our framework delivers superior consistency in maintaining coherent backgrounds and faithful human face details, as also illustrated in Figure 6. We further evaluate an ablation without the Object Integration Network (OIN), which forgoes intermediate reference images and results in inconsistent clothing details. Even when using a manually "hacked" version of GenArtist with ground truth bounding boxes and intermediate prompts, the output suffers from blurry faces and inconsistent attire, highlighting the critical role of our intermediate representations and integrated refinement process.

Refer to caption
Figure 6: Another example on indoor decoration, which demonstrates our model’s strong consistency.

Overall, these results highlight LayerCraft’s strengths in robust multi-object control, spatial coherence, and consistent detail preservation, which collectively set it apart from prior approaches.

4.2 Comparision on T2I-Compbench

We evaluate our LayerCraft framework against two categories of state-of-the-art approaches: multi-agent systems (upper part) and generic models (lower part) on T2I-Compbench huang2023t2i since the GenEval doesn’t have the statistics for the agent based models, but we’ll include our results in the supplementary materials. As shown in Table 1, LayerCraft excels in all metrics across attribute binding, object relationship, and numeracy, outperforming others due to its instance-level control capabilities.

In contrast to agent-based generation approaches, which typically employ a “generate-then-edit” pipeline, LayerCraft generates each object sequentially under explicit positional and relational constraints. The generate-then-edit paradigm can propagate early errors into later stages, producing visible artifacts in the final output444See supplementary materials for illustrative failure cases.. Generic diffusion and transformer models fare even worse: lacking the ability to reason over complex textual instructions, they systematically underperform our framework across all evaluated dimensions.

Table 1: Comparison with other methods on T2I-Compbench huang2023t2i : The \uparrow symbol denotes that higher values correspond to better performance. Our LayerCraft system achieves the state-of-the-art performance on the benchmark.
Method Attribute Binding Object Relationship Numeracy\uparrow
Color\uparrow Shape\uparrow Texture\uparrow Spatial\uparrow Non-Spatial\uparrow
LayoutGPT feng2023layoutgpt 0.2921 0.3716 0.3310 0.1153 0.2989 0.4193
Attn-Exct chefer2023attend 0.6400 0.4517 0.5963 0.1455 0.3109 -
GORS huang2023t2i 0.6603 0.4785 0.6287 0.1815 0.3193 -
RPG-Diffusion yang2024mastering 0.6024 0.4597 0.5326 0.2115 0.3104 0.4968
CompAgent wang2024div 0.7400 0.6305 0.7102 0.3698 0.3104 -
GenArtist wang2025genartist 0.8482 0.6948 0.7709 0.5437 0.3346 -
SDXL podell2023sdxl 0.6369 0.5408 0.5637 0.2032 0.3110 0.5145
PixArt-α𝛼\alphaitalic_α chen2023pixartalpha 0.6886 0.5582 0.7044 0.2082 0.3179 0.5001
Playground v2.5 li2024playground 0.6381 0.4790 0.6297 0.2062 0.3108 0.5329
Hunyuan-DiT li2024hunyuan 0.6342 0.4641 0.5328 0.2337 0.3063 0.5153
DALL-E 3 betker2023improving 0.7785 0.6205 0.7036 0.2865 0.3003 -
SD v3 esser2024scaling 0.8085 0.5793 0.7317 0.3144 0.3131 0.6088
FLUX.1-Dev blackforestlabs2024flux1dev 0.7407 0.5718 0.6922 0.2863 0.3127 0.5872
LayerCraft (Ours) 0.8643 0.7046 0.8147 0.6432 0.3508 0.6331

5 Ablation Study

Ablation on CoT Variants To rigorously assess the contribution of Chain-of-Thought (CoT) reasoning in our layout generation process, we conducted a comprehensive ablation study by comparing the full LayerCraft pipeline with systematically simplified variants. The purpose is to quantify the individual and collective impact of key CoT components on generation quality and spatial coherence.

Specifically, we evaluated the following variants:

  • Without Generation Order: Removes the CoT-driven ordering mechanism used to determine the sequence of object placement.

  • Without Object Relationship: Omits relational reasoning such as spatial prepositions or inter-object dependencies.

  • Without Both Order and Relationship: Disables both sequential placement and object relationship modeling.

  • Without All CoT for Layout Generation: Fully removes CoT reasoning from the ChainArchitect, falling back to a single-pass layout prediction without relationships with background.

Due to computational constraints, we employed a stratified sampling strategy and evaluated the models on 20% of the test data, ensuring balanced representation across object types and scene configurations.

As shown in Table 2, the complete LayerCraft pipeline consistently outperforms all ablated versions across key metrics, including object count accuracy, spatial arrangement fidelity, and overall realism. Notably, the absence of generation order and relationship reasoning leads to degraded spatial coherence and increased placement conflicts. The full removal of CoT results in the most significant performance drop, underscoring the critical role of iterative reasoning in managing compositional complexity.

These results provide strong empirical evidence that LayerCraft’s CoT-driven layout planning is essential for achieving precise multi-object control, structured scene decomposition, and robust generalization across diverse prompts.

Table 2: Ablation Study for CoT on T2I-Compbench.
Method Attribute Binding Object Relationship Numeracy\uparrow
Color\uparrow Shape\uparrow Texture\uparrow Spatial\uparrow Non-Spatial\uparrow
LayerCraft 0.8643 0.7046 0.8147 0.6432 0.3508 0.6331
w/o generation order 0.8524 0.6792 0.7853 0.4210 0.3147 0.6305
w/o object relationship 0.8512 0.6867 0.7842 0.4062 0.2854 0.6301
w/o order & relationship 0.8413 0.6463 0.7531 0.3847 0.2752 0.6023
w/o CoT for Layout Generation 0.6394 0.5639 0.7216 0.2831 0.3013 0.5663

Limitations While LayerCraft delivers strong spatial control and compositional accuracy, its use of Chain-of-Thought reasoning and multi-agent coordination introduces additional computational overhead. This can impact efficiency, particularly for complex scenes with many interacting objects. Although spatial accuracy is enhanced through background-guided bounding boxes, the primary cost lies in maintaining agent interactions. Future work will focus on streamlining these processes to improve runtime performance while preserving generation quality.

6 Conclusion

We have presented LayerCraft, a novel agent-based framework for text-to-image (T2I) generation that addresses key challenges in compositional control, spatial reasoning, and multi-object fidelity. By integrating three specialized agents, LayerCraft Coordinator, ChainArchitect, and Object Integration Network (OIN), our system supports structured planning, iterative reasoning, and object-aware image refinement in a fully automated pipeline.

LayerCraft excels in generating complex scenes with accurate spatial layouts and consistent object attributes, all without requiring model finetuning. It also enables consistent multi-image editing, making it particularly effective for tasks such as photo collage editing from a single prompt. Extensive experiments demonstrate superior performance over existing methods in both accuracy and visual coherence.

With instance-level control, real-time interactivity, and a modular design, LayerCraft offers a scalable and user-friendly solution for high-quality image synthesis across a wide range of creative and practical applications.

References

  • [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • [3] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  • [4] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024.
  • [5] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  • [6] Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-\\\backslash\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692, 2024.
  • [7] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  • [8] Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025.
  • [9] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  • [10] Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36:18225–18250, 2023.
  • [11] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  • [12] Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, and Peter Wonka. LLM blueprint: Enabling text-to-image generation with complex and detailed prompts. In The Twelfth International Conference on Learning Representations, 2024.
  • [13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [14] Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723–78747, 2023.
  • [15] Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, and Jun-Yan Zhu. Dense text-to-image generation with attention modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7701–7711, 2023.
  • [16] Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024.
  • [17] Black Forest Labs. Flux.1 [dev]. https://huggingface.co/black-forest-labs/FLUX.1-dev, 2024. A 12 billion parameter text-to-image model available under a non-commercial license.
  • [18] Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024.
  • [19] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
  • [20] Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748, 2024.
  • [21] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023.
  • [22] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2024.
  • [23] Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling. arXiv preprint arXiv:2501.02487, 2025.
  • [24] Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. arXiv preprint arXiv:2401.12863, 2024.
  • [25] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • [26] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  • [27] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • [28] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • [29] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.
  • [30] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  • [31] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • [32] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023.
  • [33] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • [34] Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
  • [35] Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. arXiv preprint arXiv:2411.15098, 3, 2024.
  • [36] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [37] A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  • [38] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, Steven Liu, William Berman, Yiyi Xu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers. Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves as a modular toolbox for inference and training of diffusion models. If you use this software, please cite it using the metadata from this file.
  • [39] Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, and Ishan Misra. Instancediffusion: Instance-level control for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6232–6242, 2024.
  • [40] Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing. Advances in Neural Information Processing Systems, 37:128374–128395, 2025.
  • [41] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • [42] Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Motion-agent: A conversational framework for human motion generation with llms. arXiv preprint arXiv:2405.17013, 2024.
  • [43] Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike Zheng Shou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023.
  • [44] Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024.
  • [45] Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. Advances in Neural Information Processing Systems, 36, 2024.
  • [46] Deshun Yang, Luhui Hu, Yu Tian, Zihao Li, Chris Kelly, Bang Yang, Cindy Yang, and Yuexian Zou. Worldgpt: a sora-inspired video ai agent as rich world models from text and image inputs. arXiv preprint arXiv:2403.07944, 2024.
  • [47] Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In International Conference on Machine Learning, 2024.
  • [48] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  • [49] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • [50] Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. arXiv preprint arXiv:2503.07027, 2025.
  • [51] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. In International Conference on Learning Representation, 2023.
  • [52] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
  • [53] Wang Zhenyu, Xie Enze, Li Aoxue, Wang Zhongdao, Liu Xihui, and Li Zhenguo. Divide and conquer: Language models can plan and self-correct for compositional text-to-image generation. arXiv preprint arXiv:2401.15688, 2024.
  • [54] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix

7 Broader Impact Statement

LayerCraft significantly advances text-to-image (T2I) generation by providing precise control over composition and object integration, driven by Chain-of-Thought (CoT) reasoning. This research promises substantial positive societal impact, primarily by democratizing visual content creation for non-experts, making sophisticated design tools accessible to a broader audience. It will revolutionize creative and professional workflows in industries like advertising and gaming, drastically accelerating content creation and fostering innovation. This capability also catalyzes new forms of digital storytelling and education, enabling richer visual narratives. While acknowledging risks like misuse for misinformation or bias propagation, which we condemn and will address through ethical guidelines and further research, our core focus remains on LayerCraft’s transformative power to empower human creativity and broadly benefit society.

8 More Examples on Batch Collage Editing

Refer to caption
Figure 7: An example of batch collage image editing. LayerCraft effectively proposes bounding boxes for necklace placement and generates a consistent reference image, leading to seamless integration across multiple images with a single prompt. In contrast, GPT-4o fails to preserve facial identity and generates inconsistent necklaces as highlighted by the red boxes.
Refer to caption
Refer to caption
Figure 8: This figure provides more user scenario, the upper part is a larger and clearer demo for the teaser image. The lower panel demonstrates an outfit modification, showing a black man’s attire seamlessly changed to a white blazer. These examples highlight the robust capabilities of our model.
Refer to caption
Figure 9: Further examples of batch collage image editing and generation. It illustrates the generation of a consistent Audi advertisement featuring a single car across five distinct scenes.

In this section, we present additional examples of batch collage image editing. Figure 7 illustrates LayerCraft’s ability to seamlessly integrate a Van Cleef necklace across multiple photos of a girl. Our model first intelligently identifies optimal placement bounding boxes for the necklace, then generates a consistent reference image to ensure uniformity throughout the process before engaging the Object Integration Network (OIN) for the final result. In contrast, GPT-4o struggles with this task, failing to preserve facial identity and generating inconsistent necklaces, as highlighted by the red boxes. Figure 8 provides further demonstrations of our model’s robust capabilities. The upper panel shows the zoomed-in version of the teaser image. The lower panel showcases a striking outfit modification, seamlessly changing a black man’s attire to a white blazer. Figure 9 effectively illustrates the generation of a cohesive Audi advertisement, featuring a single car consistently integrated across five distinct scenes.

9 Additional comparisons on T2I generation with other SOTA methods

In this section, Figure 15 presents a detailed qualitative comparison with state-of-the-art methods, including expanded versions of examples from the main paper for clearer visualization. As demonstrated, our model consistently yields better results, particularly in terms of object numeracy and accurate spatial relationships. Furthermore, LayerCraft exhibits significantly fewer artifacts compared to other agent-based methods. For instance, LLM Blueprint generates an anomalous red object beneath the table in the hot dog example. GenArtist, even in its teaser image, struggles with perspective accuracy: while the hotdogs are in focus, the distant car and bike remain sharply defined despite the blurry far end of the table, diminishing overall realism. Our method, conversely, avoids such inconsistencies, producing more coherent and realistic compositions.

Refer to caption
Figure 10: Failure case for direct attention mix in Section 10.1“Analysis of Objection Integration Network”. One can see the background is changed and the boundary is easy to see.

10 Additional Analysis on Objection Integration Network

10.1 Ablation on Attention Mixing

In our work, attention outputs are derived through the strategic blending of dual attention maps utilizing mask indices in the latent space. Specifically, our approach computes attention outputs independently for each branch and subsequently integrates the hidden states according to a latent mask derived from the original masked region. This strategic integration ensures that the model effectively learns optimal object placement within the background while preserving background integrity. Furthermore, since the Multi-Modal attention mechanism within the FLUX architecture processes textual and image tokens concurrently, we address the cross-modal correlations by implementing a weighted average of textual tokens from both branches. This dual-branch integration synergistically enhances the model’s comprehension of the conditional inputs. To validate our approach, we conducted comparative analyses against two alternative methodologies. The first alternative, inspired by OmniControl tan2024ominicontrol , involves extending the input sequence and computing the attention matrix for the entire augmented sequence. However, this methodology encountered significant convergence challenges during optimization. The second approach implements a weighted summation of attention outputs, which resulted in artifacts characterized by pronounced boundaries in the masked regions and unintended modifications to background elements. Figure 10 presents visual evidence of these failure cases comparing with the successful results obtained using our method. Our proposed methodology demonstrates superior performance compared to these alternatives, as illustrated in more results of OIN in Figures 13 and 14.

10.2 Comparisons with Concurrent Subject-driven Inpainting Methods

Refer to caption
Figure 11: Visual comparisons with concurrent works on subject-driven inpainting task. The first row is our Object-Integration Network, the second row is ACE++ mao2025ace++ , and the last row is EasyControl zhang2025easycontrol

In addition, we provide qualitative comparisons with concurrent approaches that support subject-driven inpainting to contextualize our contributions as being on par with, or even surpassing, the current state of the art mao2025ace++ ; zhang2025easycontrol . Figure 11 displays these results. A detailed examination reveals that our Object Integration Network (OIN) excels in capturing finer details, as exemplified by the orange bottle where text and logos are well-preserved, unlike the blurred text in ACE++’s output or the reverted logo in EasyControl’s result. Another instance is the violin, whose features (e.g., the chin rest) are maintained only by OIN, while they disappear in the results from ACE++ and EasyControl. Similarly, the lighted earring on the lady vanishes in EasyControl’s output. While these concurrent methods could theoretically replace OIN within our pipeline, OIN offers superior efficiency by requiring only a single loading and unloading of models, as it only necessitates loading the correct LoRA weights. In terms of speed, OIN and ACE++ generate results within one minute, depending on GPU type, whereas EasyControl is two to three times slower, further underscoring OIN’s significant contribution to the overall pipeline.

Refer to caption
Figure 12: Our approach to handling bounding box proposals in editing tasks demonstrates remarkable effectiveness. Initially, the Large Language Model (ChainArchitect) proposes an initial bounding box. This proposal is then visualized directly on the image, allowing the LLM to iteratively refine its size and position. This refinement process proves highly effective in achieving appropriate bounding box localization.

11 Handling Difficult Bounding Box Proposal.

To propose a correct bounding box using LLM only will be difficult sometimes, so we adopt a iterative refinement process to solve this problem. Figure 12 demonstrates the process. Initially, the ChainArchitect proposes an initial bounding box. This proposal is then visualized directly on the image, allowing the LLM to iteratively refine its size and position. This refinement process proves highly effective in achieving appropriate bounding box localization.

12 Overall Process Decomposed Demo

In this section we will demonstrate how the process happen within the framework. We use the example of living room decoration to demonstrate the CoT thinking process of the framework, it is demonstrated at the end of the supplementary materials.

Refer to caption
Figure 13: More resluts of OIN
Refer to caption
Refer to caption
Refer to caption
Figure 14: More results of OIN
Refer to caption
Figure 15: More comparisons with state-of-the-art models. We also include the examples used in the main paper and make them larger for details.
{dialogue}\speak

User I want to decorate this room.

\speak

LayerCraft CoT Thinking Process This image shows a minimalist, modern living room with a clean, warm aesthetic. Let me identify the objects and describe the background in JSON format.

{
"objects": [
{
"name": "reading nook",
"description": "A built-in wooden seating area with dark cushions and white pillows nestled within the wooden shelving unit"
},
{
"name": "pendant light",
"description": "A circular, disc-shaped hanging light fixture with a diffuse glow suspended from the ceiling"
},
{
"name": "area rug",
"description": "A large, neutral-toned rectangular rug covering part of the wooden floor"
},
{
"name": "shelving unit",
"description": "Built-in light wood shelving spanning the back wall with various decorative items displayed"
},
{
"name": "decorative objects",
"description": "Small sculptures, books, and framed artwork arranged sparsely on the shelves"
},
{
"name": "sliding glass doors",
"description": "Large floor-to-ceiling glass doors/windows on the left side offering views to the outdoors"
}
],
"background": {
"description": "A spacious, minimalist room with light wooden flooring, white ceiling, and light wood wall paneling. The room features built-in shelving along the back wall and large sliding glass doors that allow natural light to flood the space. The viewpoint is from the center of the room looking toward the back wall with the built-in shelving and seating area, with the glass doors visible on the left side.",
"included_elements": ["wooden flooring", "white ceiling", "light wood wall paneling", "natural lighting"]
}
}

The shelving unit and built-in seating are considered separate objects but could be viewed as part of the architectural design. I’ve included the structural elements like flooring, ceiling, and wall paneling in the background description since they define the space rather than being foreground objects.

Looking at the image, I can suggest several items that could be added to enhance the minimalist space while maintaining its clean aesthetic. Let me think about what would complement this room:

{
"suggested_additions": [
{
"name": "coffee table",
"description": "A low, rectangular wooden coffee table with clean lines, placed in front of the reading nook on the area rug"
},
{
"name": "side table",
"description": "A small, round side table in light wood or brass next to the reading nook"
},
{
"name": "floor lamp",
"description": "A slim, architectural floor lamp with warm lighting placed in the corner opposite the glass doors"
},
{
"name": "accent chair",
"description": "A minimalist armchair in a complementary neutral tone (tan leather or light gray fabric) positioned at an angle to the reading nook"
},
{
"name": "throw blanket",
"description": "A textured throw blanket in a subtle pattern or muted color draped over one corner of the reading nook"
},
{
"name": "indoor plant",
"description": "A tall potted plant like a fiddle leaf fig or snake plant placed near the glass doors to bring in natural elements"
},
{
"name": "decorative bowl",
"description": "A sculptural bowl or vessel on one of the open shelves to add visual interest"
}
]
}

These additions would enhance the functionality and warmth of the space while respecting its minimalist design language and natural materials.

Analyzing the image based on the provided grid regions, I’ll determine suitable placement areas for the suggested objects while considering perspective and available space:

{
"object_placements": [
{
"object": "coffee table",
"suitable_regions": ["center", "between center and center-right", "between lower center and center", "between lower center-right and center"],
"reasoning": "The central floor area on the rug provides ample space for a coffee table, positioned in front of the reading nook"
},
{
"object": "side table",
"suitable_regions": ["center-right", "between center and center-right"],
"reasoning": "A small side table would fit well beside the reading nook on the right side"
},
{
"object": "floor lamp",
"suitable_regions": ["lower left", "between lower left and center-left"],
"reasoning": "The corner area opposite the glass doors provides good placement for a tall floor lamp without blocking views or pathways"
},
{
"object": "accent chair",
"suitable_regions": ["lower center-left", "between lower center-left and center", "center-left", "between center-left and center"],
"reasoning": "An accent chair would fit well in this area, creating a conversation zone with the reading nook while maintaining open space"
},
{
"object": "throw blanket",
"suitable_regions": ["center-right"],
"reasoning": "The throw blanket would be placed on the reading nook which is already in this region"
},
{
"object": "indoor plant",
"suitable_regions": ["upper left", "between upper left and center-left"],
"reasoning": "Near the glass doors to receive natural light while adding greenery to that corner of the room"
},
{
"object": "wall art",
"suitable_regions": ["upper center-left", "upper center", "between upper center-left and center", "between upper center and center"],
"reasoning": "The visible wall space opposite the shelving unit would accommodate wall art while maintaining visual balance"
},
{
"object": "decorative bowl",
"suitable_regions": ["center-right", "between center and center-right"],
"reasoning": "Could be placed on one of the existing shelves in the built-in unit"
}
],
"spatial_considerations": {
"viewing_perspective": "The image is taken from a position looking toward the back wall with the built-in shelving, with the glass doors on the left side",
"floor_space": "The central and lower areas have the most available floor space for furniture placement",
"traffic_flow": "Pathways should be maintained between the entrance (presumed to be behind the viewing position) and the glass doors",
"focal_points": "The reading nook and shelving unit already serve as focal points, so additional elements should complement rather than compete with them"
}
}

This analysis considers both the physical space available in each region and the design principles of balance, flow, and functionality while maintaining the minimalist aesthetic of the room.

Based on our analysis, here are the bounding boxes and prompts for each suggested object, ordered from far to close in terms of generation order:

{
"objects": [
{
"type": "indoor plant",
"position": "upper left",
"generation_order": 1,
"prompt": "A tall, elegant fiddle leaf fig plant with large, glossy green leaves in a minimalist ceramic pot, placed near large windows to receive natural light, rendered in a clean, contemporary style that complements modern interior design.",
"bounding_box": [50, 150, 150, 350]
},
{
"type": "decorative bowl",
"position": "center-right",
"generation_order": 2,
"prompt": "A sculptural, handcrafted ceramic bowl in a soft matte finish with subtle organic texture and asymmetrical form, placed on an open shelf, rendered in a minimalist Scandinavian style.",
"bounding_box": [420, 220, 470, 260]
},
{
"type": "throw blanket",
"position": "center-right",
"generation_order": 3,
"prompt": "A soft, textured throw blanket in muted beige with subtle geometric pattern, casually draped over the corner of the built-in seating area, rendered in a warm, inviting style that adds comfort to the minimalist space.",
"bounding_box": [420, 250, 500, 300]
},
{
"type": "side table",
"position": "between center and center-right",
"generation_order": 4,
"prompt": "A small, round side table with slender brass legs and a light oak top, positioned beside the reading nook at perfect arm’s reach height, rendered in a clean, architectural style.",
"bounding_box": [350, 250, 420, 320]
},
{
"type": "floor lamp",
"position": "between lower left and center-left",
"generation_order": 5,
"prompt": "A tall, architectural floor lamp with a slender brushed brass stem and minimal white shade casting a warm glow, positioned in the corner of the room, rendered in a contemporary style that emphasizes clean lines.",
"bounding_box": [70, 320, 150, 450]
},
{
"type": "accent chair",
"position": "center-left to between center-left and center",
"generation_order": 6,
"prompt": "A minimalist lounge chair with gentle curves, upholstered in natural tan leather with a light wooden frame, positioned at an inviting angle to create conversation space, rendered in a Scandinavian modern style.",
"bounding_box": [100, 250, 250, 380]
},
{
"type": "coffee table",
"position": "center to between lower center and center",
"generation_order": 7,
"prompt": "A low, rectangular coffee table with clean lines and rounded corners, crafted from light oak with a subtle grain pattern and minimalist design, positioned centrally on the area rug, rendered in a contemporary style that balances form and function.",
"bounding_box": [200, 300, 350, 400]
}
]
}

This arrangement considers depth perception (generating background objects first), spatial relationships, and the existing elements in the room. The bounding boxes are designed to place objects in appropriate locations while maintaining a balanced, functional layout that enhances the minimalist aesthetic of the space.

\speak

User I’ve done the previous decorating and want to put this cute teddy bear in.

\speak

LayerCraft Looking at the new item you want to add to the room, I can help place the teddy bear in the previously discussed space.

{
"additional_object": [
{
"type": "teddy bear",
"position": "lower center-right",
"generation_order": 8,
"prompt": "A small, golden-brown teddy bear with a smiling face and soft plush texture, sitting upright on one of the built-in shelves, adding a touch of warmth and playfulness to the minimalist space.",
"bounding_box": [290, 300, 480, 490]
}
]
}

These placements would work well with your previous decorating plans.