SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Vaibhav Agrawal; Rishubh Parihar; Pradhaan Bhat; Ravi Kiran Sarvadevabhatla; R. Venkatesh Babu

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Intermediate

Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat et al.2/26/2026

arXiv

Key Summary

•SeeThrough3D teaches image generators to understand what should be visible and what should be hidden when objects overlap, just like in real life.
•It invents OSCR, a way to show scenes as translucent, color-coded 3D boxes rendered from a camera, so the model can reason about occlusion and orientation.
•These rendered OSCR images are turned into visual tokens and fed into a powerful text-to-image transformer (FLUX/DiT) as extra hints.
•A clever attention mask links each box area to the correct noun in the prompt (like 'dog' or 'chair'), stopping attributes from mixing between objects.
•A synthetic Blender dataset with strong occlusions plus realistic depth-based augmentations trains the system to handle tricky overlaps.
•On a new benchmark (3DOcBench), SeeThrough3D beats prior methods on layout adherence, occlusion correctness, orientation accuracy, and image quality.
•It also lets you choose the camera viewpoint, so you can rotate the scene or lower the camera and still keep occlusions right.
•Personalization ties a reference object image to a selected 3D box, so specific branded or custom items can be placed in 3D with correct visibility.
•User studies show people strongly prefer SeeThrough3D’s realism, layout following, and prompt following over baselines.
•Limitations include relying on the base generator’s skills (FLUX) and extra memory for multi-object personalization.

Why This Research Matters

Realistic occlusion is central to believable images: our brains notice instantly when overlaps are wrong. SeeThrough3D gives artists, designers, and developers fine-grained 3D control, so objects appear at the right size, place, and depth from any chosen camera. This helps architectural previews, product mockups, and game concept art look more accurate without manual Photoshop fixes. It also encourages safe, reproducible pipelines: a layout plan plus a camera directly shape the result, which is vital for design reviews and client approvals. By keeping the base model’s knowledge intact, it generalizes well beyond the synthetic training set. In short, it turns messy 2D hints into a clean 3D-aware signal that modern transformers can follow reliably.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re arranging toys on a table and taking a photo. If a toy car is in front of a teddy bear, part of the bear is hidden. Your eyes know exactly which parts you can and can’t see. Computers that make pictures from text need to learn this too.

🥬 The Concept (Occlusion Reasoning): What it is: Occlusion reasoning means understanding which parts of objects get hidden behind other objects when seen from a camera. How it works (step by step):

Picture a 3D scene with objects at different depths.
Pick a camera view (where you stand and look).
For each pixel in the view, decide which object is closest along that line.
Show only the nearest object there; hide the others. Why it matters: Without occlusion reasoning, a model might draw a dog as if it floats through a chair or make a bike appear both in front of and behind a car at the same time, which looks wrong. 🍞 Bottom Bread (Anchor): If your prompt says, “a dog behind a chair,” the chair should block some of the dog. That’s occlusion reasoning in action.
The World Before: Text-to-image models became amazing painters. They could follow prompts and make pretty pictures, and some could follow simple 2D layouts (like boxes on a flat image). But scenes in real life are 3D: objects have size, depth, and they block one another depending on where the camera is. Most controls only told the model roughly where to put things in 2D, not how they stack in 3D.
The Problem: When multiple objects overlap, many methods can’t keep track of who’s in front of whom. This breaks scale, perspective, and realism. A sedan should partly hide a bike if it stands in front of it. Without true 3D and occlusion awareness, results look confusing or flat.
Failed Attempts:

Depth maps of 3D boxes: These provide rough distance but often erase occluded objects entirely (so the model forgets they exist) and can’t pin down front-facing direction precisely.
2D layer stacks: Treat each object as a flat cutout in a pile. This ignores real 3D perspective, so mistakes like wrong camera viewpoint or impossible overlaps appear.
Orientation-only control: Some methods twist objects to a given angle but don’t place them well in 3D or manage who hides whom.
Sequential insert-and-fix loops: Add one object at a time, re-editing the image. This can cause artifacts and incoherent scenes because early steps don’t "see" the final plan.

The Gap: We needed a single representation that: (a) keeps occluded parts "known" (not deleted), (b) encodes orientation clearly, (c) includes the camera viewpoint, and (d) is easy for a modern generator to use.
Real Stakes: Why care? Designers and architects need to place furniture exactly and set camera angles; game creators need crowds and props to overlap realistically; shoppers expect believable product mockups; and students learning art or physics need correct depth cues. Better occlusion means images that feel true to life, even for busy scenes with many objects.

02Core Idea

🍞 Top Bread (Hook): You know how clear plastic boxes can show what’s behind them? If you color their faces differently and place them in 3D, you can still guess which way each box is facing—even when they overlap.

🥬 The Concept (OSCR): What it is: OSCR is a special picture of a 3D scene made of translucent, color-coded 3D boxes rendered from a chosen camera, so the model can see occlusions and orientations at once. How it works (step by step):

Put a 3D box where each object goes, sized and rotated like the object.
Color the box faces with a fixed code (front, left, right, etc.).
Make boxes translucent so hidden parts are still faintly visible.
Render from the camera to get a single image that carries layout, orientation, and viewpoint. Why it matters: Without OSCR, the model either loses occluded objects (depth maps) or flattens them (2D layers). OSCR keeps the "hidden-but-present" signal, so the generator knows where everything truly is. 🍞 Bottom Bread (Anchor): If your prompt says “a bike in front of a van,” the OSCR shows two translucent boxes with the bike’s box partly covering the van’s box from the camera’s view, and color-coded faces showing which way they point.

The “Aha!” in one sentence: Replace flat or lossy layout hints with a single camera-rendered picture of translucent, color-coded 3D boxes, then tightly link each box area to its noun in the text so the image model draws the right thing in the right place—even when parts are hidden.

Three helpful analogies:

Theater stage plan: OSCR is like a director’s stage diagram with see-through props, so everyone knows who stands where and who blocks whom from the audience’s seat (the camera).
X‑ray stickers: You stick semi‑transparent, color‑coded stickers over a window view; you still see the mountain (background) through the tree (foreground), and the colors tell which way each sticker faces.
Puzzle overlay: Instead of guessing where each piece goes, the overlay shows both front and behind pieces faintly, helping you place each new piece with confidence.

Before vs After:

Before: Models treated overlapping objects loosely; hidden parts were forgotten or layout control broke with many objects.
After: The generator gets a single, rich hint image encoding depth overlap, orientation, and camera view, plus an attention mask that binds each region to the right word. Result: crisp occlusion edges, correct scale, and fewer mix-ups between objects.

Why it works (intuition): Modern text-to-image transformers already carry a lot of world knowledge (like that vans are big, bikes are thin). If we give them a spatial "map" (OSCR) that preserves who’s in front and which way things face, and we force the right words to talk only to the right regions (attention masking), the model can use its prior knowledge to paint each object sharply where it belongs.

Building Blocks (each as a mini “sandwich”):

Attention Mechanism 🍞 Hook: When you study, you highlight the important sentences and skim the rest. 🥬 What it is: Attention lets the model focus more on important tokens (words or pixels) and less on others. How: It scores how related each token is to the current one and weighs information by those scores. Why: Without attention, the model treats everything equally and gets confused. 🍞 Anchor: To answer “What’s the capital of France?”, attention zooms in on “capital” and “France,” not filler words.
Visual Tokens 🍞 Hook: Think of tokens as tiny puzzle pieces of an image. 🥬 What it is: Visual tokens are compact pieces that represent parts of an image inside the model. How: An encoder (like a VAE) turns an image into a grid of tokens; the transformer mixes and matches them with text tokens. Why: Without tokens, images can’t be processed alongside text efficiently. 🍞 Anchor: The OSCR image becomes OSCR tokens, like puzzle pieces the model can assemble with the prompt.
Flow-based Text-to-Image (DiT/FLUX) 🍞 Hook: Imagine painting by slowly guiding random noise into a picture using gentle pushes. 🥬 What it is: A flow/diffusion transformer turns noise into an image, guided by text and extra condition tokens. How: Start with noise; at each step, use the transformer to nudge it closer to a coherent image that matches text and conditions. Why: Without these guided steps, the result would stay noisy or ignore the prompt. 🍞 Anchor: With the prompt “dog behind a chair” plus OSCR tokens, the model shapes noise into that exact scene.
Masked Self-Attention for Object Binding 🍞 Hook: During group projects, you talk to your own team members about your task, not to every group in the room. 🥬 What it is: A mask that restricts which tokens can attend to which others so the right words bind to the right regions. How: OSCR tokens inside a box are only allowed to attend to the matching object noun tokens in the prompt. Why: Without masks, attributes can bleed: the dog might get the chair’s color or swap places. 🍞 Anchor: The “bike” region attends to the word “bike,” so thin tubes and wheels appear where that box sits.
Synthetic Dataset Construction 🍞 Hook: Before the big game, teams scrimmage to practice tough plays. 🥬 What it is: Automatically built training scenes with controlled occlusions and varied camera views. How: Place 3D assets in Blender, render images and OSCR pairs, then create realistic augmentations using depth-to-image and filter bad ones. Why: Without tough, varied practice data, the model won’t learn tricky overlaps. 🍞 Anchor: Scenes like “van behind a bike with the camera low” appear often in training, so the model masters that case.
Personalized Objects Control 🍞 Hook: Like dressing your avatar in a favorite jersey, then placing it on the field. 🥬 What it is: Using a reference object image so that specific looks (logo, pattern) appear inside a chosen 3D box. How: Encode the reference image into appearance tokens; allow the OSCR region for that object to attend to these tokens. Why: Without this, the model can’t guarantee your exact item shows up—just something similar. 🍞 Anchor: A branded mug image is bound to the “mug” box, so that exact design appears on the mug in the scene.

03Methodology

High-level recipe: Input (3D boxes + camera + text) → Render OSCR (translucent, color-coded boxes from camera) → Encode OSCR into visual tokens → Concatenate with text and noisy image tokens → Masked self-attention binds words to boxes → Transformer refines noise into the final image.

Step-by-step details (what/why/example):

Build the 3D layout

What: The user (or a tool) places 3D bounding boxes—size, position, rotation—for each object and sets the camera viewpoint.
Why: Boxes are a clean, controllable handle for layout and scale; the camera bakes in perspective and view.
Example: Two boxes: a small, thin one for a bike in front; a larger one for a van behind; camera low to the ground.

Make it OSCR (translucent + color-coded) and render

What: Color the front, left, right, etc., faces with a known color code, then render all boxes as translucent.
Why: Translucency preserves hidden-but-present parts for occlusion reasoning; color coding encodes orientation.
Example: The bike’s front face may be orange; the left face blue; you can still faintly see the van’s box through the bike’s frame.

Turn OSCR into tokens

What: A VAE encoder converts the rendered OSCR image into a grid of OSCR tokens aligned with image space positions.
Why: Tokens let the transformer treat OSCR hints as spatial puzzle pieces that match the output image grid.
Example: The top-left OSCR token corresponds to the top-left area of the final image; positions stay in sync.

Pack tokens together with the text and the noisy image

What: Concatenate text tokens (prompt), OSCR tokens (condition), and the current noisy image tokens.
Why: Putting them in one sequence lets the transformer coordinate text meaning with spatial layout and visual content.
Example: Sequence: [“A”, “photo”, “of”, “bike”, “and”, “van”, …, OSCR tokens grid, noisy image tokens].

Apply masked self-attention to bind objects to boxes

What: Use an attention mask so OSCR tokens inside each 3D box only attend to the matching noun tokens; intersections attend to multiple nouns if boxes overlap.
Why: Prevents attribute mixing and enforces that each region learns from the right word(s); intersections still carry both signals, which the model can separate.
Example: Pixels where the bike overlaps the van will attend to both “bike” and “van,” but the model’s learned priors keep edges sharp and avoid blending.

Train a lightweight adapter (LoRA) on the OSCR parts

What: Insert LoRA only on projections that involve the new OSCR tokens (and appearance tokens for personalization), leaving the base model’s strong image prior intact.
Why: Keeps image quality and general knowledge while learning just enough to use OSCR properly.
Example: The model quickly learns that orange-fronted faces often mean “front” and uses that to place textures correctly.

Build the dataset with heavy occlusions and realistic variety

What: Procedurally place diverse 3D assets in Blender, render images and their OSCR, vary cameras to enforce occlusions, then generate realistic augmentations with a depth-to-image model. Filter bad augmentations with object-level CLIP checks.
Why: The model must see many tough overlaps and camera angles to master occlusions and orientations.
Example: Keep scenes where each object is partly visible but strongly overlapped; discard scenes with no overlap or almost invisible objects.

Personalization (optional)

What: Add appearance tokens from a reference image and allow the target OSCR region to attend to them, just like it attends to its noun.
Why: Ensures your specific design appears on the right object and respects occlusions and camera.
Example: A custom-patterned chair image is encoded and bound to the “chair” box so that exact pattern shows up.

The secret sauce:

Translucent OSCR preserves “hidden-but-present” signals the model can learn from, avoiding the common failure where occluded objects are forgotten.
Color-coded faces give a clean orientation cue that the model can read in image space without guessing.
Attention masking is the glue that keeps words and regions matched, preventing attribute mixing even in crowded, overlapping layouts.
Training on purposefully occluded synthetic scenes plus realistic augmentations teaches the model the hard cases it must solve in the wild.

Mini “sandwiches” where new tools appear:

OSCR (already introduced in core idea) reinforces here: it’s the compact, single-image condition capturing layout, occlusion, orientation, and camera.
Masked Self-Attention (refresher): Like study groups, tokens only talk to relevant teammates; without it, teammates swap notes and mix up tasks.
Visual Tokens (refresher): Puzzle pieces that match places on the canvas; without them, the model can’t connect OSCR hints to image regions.
Flow-based Text-to-Image (refresher): Gentle, guided steps from noise to picture; without guidance, results ignore the plan.
Synthetic Dataset Construction (refresher): Scrimmage with hard plays prepares the team for the real game; without it, test-time surprises cause fumbles.

04Experiments & Results

The test: Can SeeThrough3D place multiple objects in 3D with correct who’s-in-front, correct facing direction, and good image quality, while still following the prompt?

Benchmarks and data:

Training: 25K Blender renders (with strong occlusions) plus 25K realistic augmentations made via depth-to-image generation and CLIP-based filtering.
Evaluation: 3DOcBench (500 scenes) crafted to stress occlusions, diverse layouts, and many camera views.

The competition: Prior methods represent the baseline options people really use:

LooseControl: Uses depth maps of 3D boxes; simple and general, but tends to forget occluded items and struggles with complex overlaps.
Build-A-Scene: Adds objects in repeated generate–invert cycles; can improve adherence but risks artifacts and scene incoherence.
VODiff and LaRender: Use 2D layers/latent tricks to control visibility order; helpful for some overlaps, but lack true 3D camera and orientation awareness.

Metrics made friendly:

Depth ordering accuracy: Did the model get who’s in front of whom correct across object pairs? Higher is better.
Objectness/layout score: Did each object appear where intended and look like what it should? Higher is better.
Orientation error: How far off is each object’s facing direction from ground truth? Lower is better.
Text alignment (CLIP): Does the image match the prompt well? Higher is better.
Image quality (KID): Lower is better; think of it as smoothness and realism.

The scoreboard (with context):

Depth ordering: SeeThrough3D is the best. Think of it like being the only student who consistently explains which friend stands in front in the class photo.
Objectness/layout: Highest for SeeThrough3D, meaning it puts the right things in the right spots with less confusion.
Orientation error: Much lower for SeeThrough3D thanks to color-coded faces—like having a compass; others guess and flip things 180°.
Text alignment: Strong for SeeThrough3D, showing it doesn’t lose track of the prompt while following layout.
KID (quality): SeeThrough3D achieves the lowest (best) score, like getting an A for neat handwriting while also acing the answers.

User study results: In A/B comparisons, people strongly preferred SeeThrough3D for realism, layout following, and prompt following versus each baseline. This mirrors the metrics: it looks better, obeys the plan better, and still matches the text well.

Surprising findings:

Overlap attention: Even when OSCR regions overlap and attend to multiple nouns, the model’s internal features stay separated enough to keep crisp edges. Attention maps visibly trace the occlusion boundaries—inside the bike’s empty frame you can still “see” the van’s attention, matching real-world visibility.
Generalization: Trained on synthetic scenes with limited categories and rigid poses, the method still handled out-of-domain items (like instruments) and natural poses (like sitting), likely because the base model’s strong prior is preserved by the light-touch training.
Camera control: Moving the camera (e.g., lowering it) keeps occlusions consistent, which 2D-only controls struggle with.

Ablations that teach lessons:

Without transparency: Losing see-through cues harms occlusion reasoning and depth ordering.
Without color coding: Orientation accuracy drops; faces need distinct colors to ground which side is front/left/right.
Without binding (attention mask): Objects drift or mix traits; binding is essential for multi-object correctness.
Without hard data (weak occlusions): The model learns less about overlaps; filtering for strong occlusions pays off.

05Discussion & Limitations

Limitations (be specific):

Base-model ceiling: The approach inherits strengths and weaknesses of the underlying generator (FLUX/DiT). If FLUX struggles with rare cases (e.g., a parrot correctly behind cage bars), SeeThrough3D may also stumble.
Personalization memory: Multi-object personalization needs more VRAM since reference tokens must fit in the transformer’s context simultaneously.
Not a full 3D editor: It controls layout and viewpoint in 2D image generation, but it doesn’t output a full 3D asset or guarantee cross-view consistency across different layouts yet.
Data bias: Although the dataset is carefully built, it’s still synthetic and selected for strong occlusions; real-world clutter and long-tail categories can be harder.

Required resources:

A capable text-to-image transformer (e.g., FLUX) and GPU(s) for training/fine-tuning LoRA.
Blender (or similar) to procedurally render box layouts and views.
Optional: Depth-to-image augmentation and CLIP filtering pipeline to expand realism and keep layouts aligned.

When not to use:

If you need a true 3D scene you can orbit smoothly around (full 3D reconstruction), this won’t replace NeRFs or 3D Gaussians.
If your use-case is single-object, no-overlap posters, simpler 2D-guided tools may suffice.
If you require strict multi-view consistency across drastically different layouts with the same exact objects, this method isn’t designed for that yet.

Open questions:

Can we keep image identity stable under layout edits (like moving the same dog box to a new spot without changing its style)?
Could OSCR be extended beyond boxes to curved proxies or coarse meshes while staying simple and fast?
How far can attention masking scale with dozens of heavily overlapping objects?
Can personalization be made memory-light for many subjects at once?
What’s the best way to fuse real photos with synthetic renders for even broader generalization?

06Conclusion & Future Work

Three-sentence summary: SeeThrough3D introduces OSCR—translucent, color-coded 3D boxes rendered from a camera—as a simple but powerful hint image that teaches text-to-image models how to handle occlusions, orientations, and viewpoint. A masked attention scheme binds each prompt noun to its corresponding OSCR region, stopping attribute mix-ups in busy, overlapping scenes. Trained on synthetic but realistic-looking data, the method outperforms prior approaches on layout, occlusion, and quality, and also supports camera control and personalization.

Main achievement: Turning a hard 3D reasoning problem (occlusion plus orientation plus camera) into a single rendered condition image that transformers can easily digest—then locking words to regions with a neat attention mask.

Future directions: Maintain the same object identity under layout edits; explore richer 3D proxies beyond boxes; cut memory for multi-subject personalization; and blend real-world data with synthetic layout scenes for stronger robustness. Also, extend to video so moving cameras and objects keep perfect occlusion over time.

Why remember this: It shows that a clear, compact 3D-aware picture (OSCR) and a tiny bit of smart attention control are enough to unlock realistic, occlusion-correct multi-object generation—precise, practical control for artists, designers, and anyone who wants scenes that look truly 3D.

Practical Applications

•Interior design mockups where furniture overlaps realistically from any camera angle.
•Game level concept art with crowds and props that properly occlude each other.
•Product placement previews (e.g., shelf arrangements) with correct front/back ordering.
•Film pre-visualization storyboards with precise camera control and staging.
•Education demos that show how occlusion and perspective work in real scenes.
•AR/VR scene planning with believable partial visibility before building assets.
•E-commerce lifestyle images where multiple items overlap naturally.
•Robotics simulation visuals where objects block each other correctly for sensor planning.
•Brand-safe ads that personalize products (logos, patterns) in exactly placed 3D boxes.
•Urban planning visuals with cars, people, and structures rendered with accurate depth.

Version: 1