BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

Eliran Kachlon; Alexander Visheratin; Nimrod Sarid; Tal Hacham; Eyal Gutflaish; Saar Huberman; Hezi Zisman; David Ruppin; Ron Mokady

BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

Beginner

Eliran Kachlon, Alexander Visheratin, Nimrod Sarid et al.2/24/2026

arXiv

Key Summary

•BBQ is a text-to-image model that lets you place objects exactly where you want using numeric bounding boxes and color them with exact RGB values.
•Instead of changing the model’s architecture, BBQ simply learns from captions that include numbers for positions and colors.
•A helper vision–language model (VLM) turns short prompts into a detailed JSON plan with boxes and colors, so users can just drag objects or pick colors.
•BBQ keeps things disentangled: changing a box or a color only changes that part of the image and leaves the rest alone.
•On a reconstruction test (TaBR), people preferred BBQ’s outputs over leading models like Flux.2 Pro, FIBO, and Nano Banana Pro.
•For spatial accuracy, BBQ beats strong baselines and GLIGEN on COCO and LVIS, while being slightly behind the specialized InstanceDiffusion.
•For color accuracy, BBQ best matches target hues and saturation (chroma) across tests, showing fewer big mistakes.
•This approach creates a new workflow: user intent → structured numeric plan → the model renders it like a precise art engine.
•The system supports easy, pro-style controls (drag to move, resize, color pickers) without tricky prompt wording.
•BBQ suggests a future where image generation is programmable, precise, and friendly for professional design tasks.

Why This Research Matters

BBQ turns image generation into a precise, professional tool by letting people say exactly where things go and what colors they should be. That saves time for designers, marketers, and creators who can now drag objects to positions and pick exact brand colors instead of wrestling with vague prompts. It makes revisions easy: tweak numbers, regenerate, and only the intended parts change. This approach also broadens accessibility—simple interfaces like sliders, color pickers, and drag handles become the main way to control images. By avoiding complex architectural add-ons or slow inference tricks, BBQ keeps workflows fast and maintainable. The result is a smoother path from idea to production-quality visuals. It also lays the groundwork for even richer controls like poses, materials, and lighting in the future.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you ask a friend to draw you a picture, saying 'Put a red ball near the bottom-right,' they might guess where 'bottom-right' is and what shade of red you meant? Words can be fuzzy.

🥬 The Concept: Text-to-Image Models

What it is: A text-to-image model is a computer artist that paints pictures from your words.
How it works:
1. Read your description
2. Imagine what that should look like
3. Turn that idea into a picture
Why it matters: Without this, you’d have to draw everything yourself. With it, you can create art and designs quickly just by describing them. 🍞 Anchor: Say 'A brown puppy playing in a park' and the model draws exactly that—no crayons needed.

🍞 Hook: Imagine giving a recipe to a chef. 'Make it delicious' isn’t helpful; you need steps and details.

🥬 The Concept: FIBO-style Structured Captions

What it is: A structured caption is a super-detailed recipe for an image, listing objects, attributes, relations, and style.
How it works:
1. Break the scene into pieces (who, what, where, how it looks)
2. Write them in a consistent, organized format (like JSON)
3. Feed that to the model so it knows exactly what to draw
Why it matters: Without structure, the model might miss small but important details. 🍞 Anchor: 'A red car, shiny paint, parked left of a blue bike, sunny lighting' produces a scene that matches those specifics.

🍞 Hook: Think of a stage play. If you tell actors 'stand kind of over there,' the blocking gets messy.

🥬 The Concept: Bounding Boxes

What it is: A bounding box is a rectangle that says where an object should go and how big it should be in an image.
How it works:
1. Pick the top-left corner
2. Pick the bottom-right corner
3. The box between them marks the object's spot and size
Why it matters: Without boxes, 'top-right' can mean different things; boxes give exact coordinates. 🍞 Anchor: 'Dog at (0.10, 0.50) to (0.30, 0.85)' tells the model precisely where to place the dog.

🍞 Hook: If you’ve ever tried to get the perfect paint color for a bedroom, you know 'red' isn’t enough; you need the exact shade.

🥬 The Concept: RGB Color Control

What it is: RGB is a way to pick exact colors using three numbers: Red, Green, and Blue.
How it works:
1. Choose how strong the red is (0–255)
2. Choose green (0–255)
3. Choose blue (0–255)
4. The combination gives a precise color
Why it matters: Words like 'crimson' or 'maroon' are vague; RGB is exact. 🍞 Anchor: 'Shirt color: (220, 32, 167)' makes the shirt exactly that bright pinkish-purple.

🍞 Hook: Picture a sound mixer with separate sliders for volume, bass, and treble. Move one, the others stay put.

🥬 The Concept: Disentangled Control

What it is: Disentanglement means you can change one thing (like color) without messing up others (like position).
How it works:
1. Divide the scene info into clear parts (layout, color, style)
2. Let each part be adjusted on its own
3. Recombine them to make the final image
Why it matters: Without it, changing one detail could accidentally change the whole picture. 🍞 Anchor: Move the dog’s box to the right; only the dog moves. Change the shirt’s RGB; only the shirt changes color.

The World Before: Early text-to-image models were great at artful surprises but not at following precise instructions. Then came long, structured captions (like FIBO), which boosted control using clear language. But language stayed a little fuzzy for pro tasks that need exact numbers.

The Problem: Professionals—designers, advertisers, filmmakers—need to say 'Put the logo exactly here: (x1, y1, x2, y2)' and 'Use exactly this brand color: (R, G, B).' Words alone can’t guarantee that level of precision.

Failed Attempts:

Special architectures (like adding extra position tokens or new modules) worked but added complexity and were hard to maintain.
Training-free tricks at generation time could nudge layouts, but they were delicate and sometimes slow.
Color control methods often needed extra adapters or special losses, or they still leaned on ambiguous color words.

The Gap: A simple, scalable way to put numbers directly into what the model reads—without changing the model’s guts or doing fiddly tricks during generation.

The Stakes: Real jobs depend on it—placing furniture in a room mockup, color-matching a brand logo, storyboarding a scene, or making a catalog image where items must sit in exact slots. If a tool can take precise boxes and colors, it becomes far more reliable for daily professional work.

02Core Idea

🍞 Hook: Imagine building LEGO with a blueprint that has exact measurements for where every piece goes and the exact color for each brick—no guessing.

🥬 The Concept: The Aha! Moment

What it is: BBQ teaches a big text-to-image model to read numbers (boxes and RGB) inside structured text, so it can place objects and paint them with exact colors—no architecture changes needed.
How it works:
1. Take long, structured captions and add numeric boxes and RGB values
2. Train the existing model on tons of these enriched captions
3. Use a helper model to turn short prompts into these detailed, numeric captions
4. Let users edit numbers (drag boxes, pick colors) and regenerate with targeted changes
Why it matters: Without this, professionals juggle vague words and trial-and-error prompts; with it, they get pro-style, deterministic control. 🍞 Anchor: 'Place the mug at (0.55, 0.60)–(0.70, 0.85), color (34, 139, 34),' and BBQ reliably draws a forest-green mug exactly there.

Three Analogies:

Recipe with exact measurements: '2.00 cups flour, 1.00 tsp salt' beats 'some flour, a pinch of salt.' Numeric boxes and RGB are exact measurements for images.
Stage tape for actors: Mark X’s on the floor; actors hit their spots. Boxes are the tape marks for objects.
Paint-by-numbers: Each region gets a precise color code; RGB makes sure the color is spot-on.

Before vs After:

Before: 'Put the cat near the bottom-right and make the shirt crimson' might come out close but not exact.
After: '(x1, y1, x2, y2) for the cat; RGB for the shirt' consistently lands the cat in the right place and nails the color.

🍞 Hook: Think of reading a map where every street has a name and every house has a number—easy to navigate.

🥬 The Concept: Parametric Structured Prompts

What it is: A structured 'mini-language' (like JSON) that includes words plus numbers for boxes and colors.
How it works:
1. Split the scene into objects with attributes
2. Give each object a box and, if needed, an RGB color
3. Keep everything in a tidy, consistent format the model can learn
Why it matters: Without a clear language for numbers, models can’t take precise instructions. 🍞 Anchor: The prompt includes: 'woman: box: (0.20, 0.25, 0.40, 0.80), shir $t_r$ gb: (220, 32, 167).'

🍞 Hook: You know how a great translator turns a quick idea into a perfect plan?

🥬 The Concept: Vision–Language Model (VLM) as a Bridge

What it is: A VLM is a helper that expands your short idea into the full parametric plan the generator needs.
How it works:
1. Read a brief prompt ('two people walking a dog')
2. Propose a reasonable layout with boxes and colors
3. Or edit an existing plan when you say 'move the dog right' or 'make the jacket blue'
Why it matters: Without the bridge, writing precise JSON by hand is tedious. 🍞 Anchor: Type 'A kid and a parent with a kite' → the VLM outputs boxes and colors → BBQ renders a coherent scene.

🍞 Hook: Think of a printer that perfectly follows a blueprint.

🥬 The Concept: Flow-based Transformer as Renderer

What it is: A flow-based transformer is a model that learns how to turn a plan into an image smoothly and reliably.
How it works:
1. Take the structured plan as input tokens
2. Follow learned 'flow' steps that guide pixels from noise to a clear image
3. Produce the final picture aligned with the plan
Why it matters: Without a strong renderer, even the best plan wouldn’t become a faithful image. 🍞 Anchor: Feed the JSON plan in, and out comes the image with objects in the right boxes and with the right colors.

Why It Works (intuition):

Transformers are great readers. If you teach them that numbers in the caption mean real positions and colors, they learn to respect them.
Lots of examples (25 million!) help the model discover tight links between numeric tokens and visual results.
Structured prompts keep attributes separate, so moving a box doesn’t blur the color, and changing a color doesn’t scramble the layout.

Building Blocks:

Data Enrichment: Start from structured captions and add boxes and RGB from reliable tools, so numbers always reflect what’s in the image.
BBQ Training: Keep the architecture, feed it numeric-augmented captions at scale, and let it learn precise control.
Parametric Bridge (VLM): A helper that generates, refines, and inspires (extracts) parametric prompts, so humans don’t have to write JSON.
Interactive Edits: Change the numbers (drag boxes, pick colors), reuse the same seed, and see only the intended parts change.

03Methodology

At a high level: Short Prompt → VLM expands to Parametric JSON (with boxes and RGB) → BBQ reads JSON and renders Image → User edits numbers → BBQ updates the image while keeping other parts stable.

Step 1: Build the Parametric Training Data

What happens: Each training image gets a long, structured caption that is then enriched with numeric bounding boxes and RGB values for objects and a global palette.
Why this step exists: The model must see many examples where numbers in text match what appears in the picture, or it won’t learn precise control.
Example: An image of 'a woman in a pink jacket holding a yellow umbrella' becomes a JSON with 'woman.box: (0.22, 0.18, 0.46, 0.90)' and 'jacke $t_r$ gb: (220, 32, 167)', 'umbrell $a_r$ gb: (230, 210, 50)'.

Step 2: Train the BBQ Generator (No Architecture Changes)

What happens: Start from a strong text-to-image backbone that already understands structured captions. Continue training it on 25 million caption+image pairs where the captions now include boxes and RGB.
Why this step exists: We want the model to treat numbers as real instructions. Scale and consistent formatting make that link stick.
Example: After enough examples, 'cat.box: (0.15, 0.40, 0.35, 0.80)' reliably places the cat there; 'shir $t_r$ gb: (34, 139, 34)' reliably paints the shirt forest green.

Step 3: Add the Parametric Bridge (VLM)

What happens: Fine-tune a compact VLM to convert short prompts or edit instructions into complete JSON with boxes and RGB, and to extract JSON from a reference image when needed.
Why this step exists: Writing exact coordinates and RGB codes is hard and slow for humans; the VLM handles the heavy lifting.
Modes:
1. Generate: Make a full JSON from a short prompt
2. Refine: Edit an existing JSON based on your instruction ('move the dog right by 10%') while keeping the plan coherent
3. Inspire: Read a reference image and output its parametric JSON as a starting template
Example: 'Three bottles in a row, red, green, blue' becomes three boxes with evenly spaced coordinates and the correct RGB codes.

Step 4: Interactive Editing and Re-generation

What happens: Users drag a box, adjust its size, or pick a new color, then regenerate with the same seed so that only the targeted elements change.
Why this step exists: Professionals need predictable, local edits without redoing the whole scene.
Example: Swap the boxes of 'man' and 'woman' in the JSON; the people switch places while the background and lighting stay constant.

🍞 Hook: You know how following a well-written recipe keeps the cake tasty even if you change only the frosting color?

🥬 The Concept: Secret Sauce — Do It With Data, Not Hardware

What it is: The clever part is that BBQ doesn’t add new modules or special inference tricks. It just learns from better (numeric) captions.
How it works:
1. Use a structured format to keep attributes separated
2. Insert exact numbers (boxes, RGB) directly into the text the model reads
3. Train at scale so the model internalizes the numeric-to-visual mapping
Why it matters: Without extra modules to maintain or slowdowns at generation time, the system is simpler and more robust. 🍞 Anchor: It’s like teaching the same chef a recipe with precise measurements—no need for a new oven.

Keeping Scenes Coherent During Edits

What happens: When you move boxes in ways that would change the story (like separating hugging people), the VLM updates the textual part of the plan to keep things realistic (e.g., adjust pose or relation).
Why this step exists: Numbers alone can break the logic of a scene; the structured plan must stay consistent.
Example: If two dancers’ boxes move apart, the caption changes from 'hugging' to 'standing side-by-side', and BBQ draws a plausible update.

How BBQ Reads Numbers

What happens: Coordinates (normalized 0–1) and RGB (0–255) are placed in the JSON so the model tokenizes them like words.
Why this step exists: Transformers excel at learning patterns in sequences; once numbers are in the sequence, they can be learned like any other token.
Example: The model learns that '(0.70, 0.20)' tends to show objects in the upper-right, and '(255, 0, 0)' is bright red.

Reliability Through Scale and Structure

What happens: Many diverse examples and consistent formatting prevent overfitting to specific layouts and help the model generalize to new scenes.
Why this step exists: To make sure the model can follow new numeric instructions it has never seen before.
Example: Even if the model never saw 'teal teapot at (0.12, 0.35, 0.25, 0.55),' it can still place it and color it correctly because it understands the numeric language.

04Experiments & Results

🍞 Hook: Imagine testing a GPS. You check: Does it find the right place? Does it get the route right? And does it work better than other GPS apps?

🥬 The Concept: How BBQ Was Tested

What it is: The team measured (1) overall faithfulness to complex scenes, (2) how well objects land inside their requested boxes, and (3) how close colors match target RGBs.
How it works:
1. Text-as-a-Bottleneck (TaBR): Recreate real images from detailed captions and ask people which recreation is closer to the original
2. Bounding-box accuracy: Generate images from prompts with numeric boxes and check alignment using trained detectors
3. Color accuracy: Ask for exact RGB colors on single objects and measure how close the result is to the target
Why it matters: Without solid tests, you can’t tell if numeric control truly works. 🍞 Anchor: If BBQ consistently places a 'blue mug' exactly where asked and with the right blue, that’s a clear win.

🍞 Hook: Taking good notes from a long class helps you retell it accurately.

🥬 The Concept: TaBR (Text-as-a-Bottleneck Reconstruction)

What it is: A test where a VLM produces a detailed caption of a real image; models then rebuild the image from that text, and judges pick which looks more like the original.
How it works:
1. Start with a real image
2. Create a structured caption describing it
3. Competing models render from that caption
4. Humans vote on which is closer to the original
Why it matters: It measures a model’s overall expressiveness and alignment to detailed instructions. 🍞 Anchor: BBQ’s reconstructions kept layouts and details so well that it won most head-to-head comparisons, including a 93.3% win rate versus Flux.2 Pro among decisive decisions.

Results (TaBR):

Against Nano Banana Pro: BBQ wins 65.2% of decisive comparisons.
Against FIBO: BBQ wins 76.1%.
Against FLUX.2 Pro: BBQ wins 93.3%. Interpretation: Think of this like getting an A when others get B’s; BBQ more often recreates the original’s composition and details.

🍞 Hook: If you tape X marks on the floor for dancers, you want them to end up exactly on the tape.

🥬 The Concept: Bounding-Box Accuracy

What it is: A box-following test that checks whether generated objects appear inside the requested numeric boxes.
How it works:
1. Generate images with given boxes
2. Run an object detector (like YOLO) to find objects in the image
3. Compare detected boxes to the requested boxes
Why it matters: It tells you if the model can truly follow spatial instructions. 🍞 Anchor: On COCO and LVIS datasets, BBQ outperforms strong general models and GLIGEN, while slightly trailing the specialized InstanceDiffusion.

Scoreboard (COCO/LVIS):

BBQ box alignment (example COCO AP: 28.6) > Nano Banana Pro and Flux.2 Pro, and > GLIGEN.
InstanceDiffusion remains best at strict box-following but is a specialized system; BBQ reaches strong accuracy without special architecture or slow sampling tricks.

🍞 Hook: When matching paint colors at the store, you compare your sample card to the paint mixed by the machine.

🥬 The Concept: Color Fidelity with CIEDE2000 and Chroma Distance

What it is: Two ways to measure how close a generated color is to the requested one—one focuses on human-perceived difference (CIEDE2000), the other on hue+saturation (a–b distance) regardless of lightness.
How it works:
1. Generate single-object images on white
2. Segment the object and cluster its colors (K-means)
3. Pick the cluster closest to the target color
4. Compute distances (lower is better)
Why it matters: We want the right color shade, not just roughly similar or lighter/darker. 🍞 Anchor: BBQ achieved the lowest chroma (a–b) errors across tests, meaning it nailed the actual hue and saturation more often and avoided big misses.

Scoreboard (examples):

a–b mean error: BBQ ~7.16 vs 9–11 for baselines (lower is better), with similarly strong medians and fewer large errors (p90).
CIEDE2000: BBQ is competitive; some baselines score slightly better by making lighting more uniform. BBQ preserves realistic lighting while still matching the color’s core identity well.

Surprises and Insights:

Big one: BBQ didn’t need new modules or inference hacks to follow numbers, just better, numeric-enriched training data.
BBQ is a generalist that still competes with layout specialists—impressive given its simplicity and speed at inference.
Color is hard under complex lighting; BBQ’s strong chroma scores show it respects the requested shade even when the scene isn’t flat-lit.

05Discussion & Limitations

Limitations:

Box/Color Input Quality: If the numeric boxes or target RGBs are wrong or inconsistent with the rest of the plan, results can look odd.
Extreme Edits: Stretching or separating boxes in ways that change the story (e.g., pulling hugging people apart) requires the VLM to rewrite relationships; it may sometimes over- or under-correct.
Specialized Layout Gap: InstanceDiffusion still leads in pure box-following; BBQ trades a bit of that for generality and simplicity.
Lighting vs Color: Matching exact chroma under dramatic lighting is tough; sometimes lightness differs even when hue is right.
Bridge Reliability: The VLM bridge can occasionally output implausible or clashing layouts without enough guardrails.

Required Resources:

A strong backbone model (≈8B parameters) trained on ~25M image+caption pairs with parametric annotations.
Data tools for segmentation, detection, depth, and color palette extraction to enrich captions at scale.
GPUs for training and a modest VLM fine-tune for the bridge.

When NOT to Use:

Purely free-form art where exact positions/colors don’t matter and you want maximum surprise.
Ultra-constrained technical drawings needing micron-level accuracy.
Situations with severe motion or occlusion where boxes are hard to define meaningfully.
Workflows that require non-RGB color spaces (like Pantone or spectral colors) not yet encoded in the schema.

Open Questions:

Beyond Boxes and RGB: Can we add poses, materials, lighting rigs, or physics as numeric parameters in the same language?
Robust Coherence: How can the bridge better rewrite scene relationships after large box edits—while staying faithful to the user’s intent?
Learning From Edits: Can the system self-improve from user drag-and-drop sessions to refine future layouts?
Multi-Object Interactions: How to keep consistency when many overlapping boxes and exact color requests interact?
Color Under Real Lighting: Could we model local illumination explicitly so both chroma and lightness match under complex shading?

06Conclusion & Future Work

Three-Sentence Summary: BBQ lets you tell a big image model exactly where to put objects (with boxes) and exactly what colors to use (with RGB) by teaching the model to read numbers inside a structured prompt. It achieves precise spatial and color control without changing the architecture or using slow, tricky inference steps—just better, numeric-enriched training data plus a VLM bridge. The results show stronger box alignment and color fidelity than leading general-purpose models, while keeping edits nicely disentangled.

Main Achievement: Turning numeric parameters into first-class citizens inside a structured text prompt—and proving at scale that a general text-to-image transformer can natively follow those numbers.

Future Directions: Extend the schema to include poses, materials, lighting specs, and even temporal controls for video; improve the bridge’s reasoning to maintain scene coherence after large edits; and explore color spaces beyond RGB for print and product design. Also, investigate uncertainty-aware interfaces that suggest safe edits when requested changes might break the scene.

Why Remember This: BBQ marks a shift from descriptive prompting to programmable image making—where you don’t just hope the model understands, you tell it exactly what to do with simple, familiar tools like dragging and color pickers. That unlocks pro-grade reliability while staying user-friendly, setting the stage for truly controllable, production-ready generative systems.

Practical Applications

•Brand-safe ad creation: Place logos in exact box locations and match official RGB colors.
•E-commerce catalogs: Arrange products into grid slots precisely and keep consistent colorways.
•Storyboards and pre-visualization: Block character positions with boxes and set costume colors.
•UI/UX mockups: Position icons and components at exact coordinates with consistent palettes.
•Fashion design previews: Recolor garments to exact RGB values and place models in set positions.
•Interior design mockups: Place furniture into specific areas of a room and test fabric colors.
•Educational graphics: Precisely position labeled objects in diagrams with uniform color codes.
•Comics and manga layout: Keep characters in panel-specific boxes and apply stable color schemes.
•A/B testing visuals: Shift one object’s position or hue numerically to test audience response.
•Template-based content: Reuse a parametric JSON to generate many variations with controlled edits.

Version: 1