From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors
Key Summary
- •The paper turns image editing from a one-step “before → after” trick into a mini physics simulation that follows real-world rules.
- •It builds a new video dataset, PhysicTran38K, that shows how things actually change over time (like melting, bending, and refraction).
- •The method, PhysicEdit, thinks in two ways at once: with words (physics rules written out) and with vision (hidden motion hints learned from videos).
- •Hidden “transition queries” act like tiny helpers that remember how scenes typically change, so the editor can stay physically correct.
- •Two visual teachers guide those helpers: DINOv2 for big shapes and layouts, and a VAE for small textures and details.
- •Guidance changes over time in the diffusion process: early steps favor structure; later steps polish textures.
- •On tough tests (PICABench and KRISBench), PhysicEdit beats strong open-source models and rivals some closed models, especially on physics-heavy tasks.
- •It fixes classic mistakes like drawing a straight straw in water (ignoring refraction) or moving shadows the wrong way.
- •The approach improves realism without hurting everyday edits like adding or replacing objects.
- •This matters for AR, e-commerce, education, design, and safety visuals where believable physics is essential.
Why This Research Matters
In everyday photos and designs, people instantly sense when physics is wrong, even if they can’t explain why. PhysicEdit reduces those “that looks fake” moments by teaching the model how changes actually unfold over time. This makes AR overlays more believable, product edits more trustworthy for shoppers, and educational visuals better at conveying real cause-and-effect. Designers and engineers can iterate faster with previews that honor material behavior and light transport. Safer training materials can illustrate correct deformation, cooling, and lighting cues. In short, edits become both faithful to instructions and faithful to the world.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how a comic book shows you the first panel and the last panel, but the cool part is watching what happens in between? If you only saw the first and last pictures, you might guess wrong about how the story actually unfolded.
🥬 The Concept (Physics Awareness): What it is: Physics awareness means an AI editor understands how the real world behaves—like gravity pulling things down, light bending in water, or soft stuff squishing. How it works:
- Notice the physical setup (materials, shapes, contacts).
- Apply the right rule (e.g., gravity, refraction, melting).
- Predict a believable way the scene changes step by step. Why it matters: Without physics awareness, edits look right at first glance but feel off—like a straight straw through water or a shadow that moves the wrong way. 🍞 Anchor: When someone says “insert a straw into the glass,” a physics-aware model draws the straw bent at the waterline (refraction) instead of straight.
The World Before: For years, instruction-based image editors got great at following directions like “make the sky pink” or “add a tree.” They used powerful diffusion models and multimodal reasoning to match the words to the picture. But they mostly treated editing as flipping a switch from one image to another. That worked for style changes or simple swaps, but stumbled when the instruction implied real-world cause and effect—like bending, freezing, shattering, or light refraction. Models often prioritized matching nouns and adjectives (“add straw,” “turn off lamp”) over obeying physical rules (refraction, shadow propagation, heat transfer).
🍞 Hook: Imagine telling a friend “drop the ball” and they instantly hand you a picture of the ball on the floor. You’d ask, “But how did it fall? Where’s the blur or bounce?”
🥬 The Concept (Static vs. Dynamic Editing): What it is: Traditional editors map a starting image and a text instruction straight to a final image, skipping how things change along the way. How it works:
- Read the instruction.
- Change pixels to match the request.
- Output the final image. Why it matters: Skipping the “in-between” means the edit can ignore laws like gravity, friction, or optics, creating results that look okay but feel fake. 🍞 Anchor: “Let the ball fall” might produce a ball lower in the frame with no motion blur, wrong bounce, or mismatched shadow.
The Problem: Physical realism requires the path, not just the endpoints. A straw doesn’t just ‘teleport’ into proper shape; light doesn’t just instantly dim everywhere at once; a soft object doesn’t become compressed without showing squash, stretch, and contact. Paired image datasets teach only “before vs. after.” That’s like giving a student the first and last step of a science experiment without showing the procedure. The result: high semantic alignment (“got the right object”) but low physical plausibility (“broke the laws”).
Failed Attempts:
- More instruction tuning: better at following the text, but still fuzzy on real physics.
- Attention hacks or latent inversions: great for style or small edits, but brittle for big structural changes.
- Explicit frame rollouts (generate intermediate frames during inference): helps with reasoning but can be slow, compute-heavy, and error-prone over time.
🍞 Hook: Think of baking cookies. If you only check the raw dough and the final cookie, you might bake too hot or too fast. Watching the middle steps (dough spreading, browning) teaches you the real process.
🥬 The Concept (Physical State Transition): What it is: Treat editing as predicting a scene’s journey from start to finish under real laws—like a tiny simulation. How it works:
- See the initial state (the source image).
- Read the trigger (the instruction, e.g., “heat,” “push,” “insert”).
- Follow the right physical law (e.g., melting, refraction, deformation).
- Evolve the scene step by step to a realistic final state. Why it matters: If you learn the path, the destination becomes believable. 🍞 Anchor: “Freeze the soda can” should show frost forming, condensation freezing, and a duller surface—not just a random icy sticker.
The Gap: We needed training that shows the in-between. Videos are perfect because they reveal motion, deformation, and optical changes over time. But we also needed a way to use those videos during training while still producing a single edited image at test time, fast and reliably.
Real Stakes:
- Everyday trust: Product photos, real estate, or news-like visuals must look plausible, or people get misled.
- AR and education: Overlaying physics-consistent visuals helps students and users learn real cause-and-effect.
- Design and prototyping: Engineers and artists want edits that honor material properties and light behavior.
- Safety: Training materials should visualize correct shadowing, heat, and breakage patterns to prevent false intuitions.
This paper fills that gap with two big moves: (1) a new dataset, PhysicTran38K, that organizes and verifies video examples of real physical transitions, and (2) a new editor, PhysicEdit, that learns hidden “transition priors” from video, plus writes down physics hints in text, so it can produce a single, physically faithful edited image—no slow video rollout needed.
02Core Idea
🍞 Hook: Imagine a flipbook. If you only keep the first and last page, your drawing might miss the arc of a jump or the bend of a twig in the wind. The magic lives in the pages between.
🥬 The Concept (Key Insight): What it is: The big idea is to learn how scenes change over time (from videos) and pack that knowledge into tiny helpers (transition queries) plus a physics “notes” page (reasoning), so a single edited image comes out physically correct. How it works:
- Train with videos that show real transitions (melting, refraction, bending).
- Distill those dynamics into compact latent transition priors (transition queries).
- Use a language-vision model to write physics-aware guidance text.
- Guide a diffusion editor step by step—structure first, texture later. Why it matters: It bridges the gap between “the right words” and “the right physics,” so edits feel natural and believable. 🍞 Anchor: When told “insert a straw into water,” the model bends the straw at the waterline and adjusts the magnification below the surface.
Three Analogies:
- GPS directions: Old editing is like teleporting to a destination; new editing is like following turn-by-turn directions that avoid impossible roads. The directions are the state transition.
- Recipe timing: Not just ingredients (objects) but the order and timing (heating then browning). The transitions capture when and how changes happen.
- Coach + muscle memory: The text branch is the coach reminding rules; the transition queries are muscle memory learned from practice videos.
🍞 Hook: You know how you first set up the Lego frame (structure) and later snap on the smooth tiles (details)?
🥬 The Concept (Diffusion Models, coarse-to-fine): What it is: Diffusion models build images in stages: big shapes come first, small details later. How it works:
- Start noisy.
- Early steps focus on global structure.
- Later steps refine textures and fine patterns. Why it matters: If we match the right kind of guidance to each stage, we get stronger physics and better details. 🍞 Anchor: When turning off a lamp, early steps set the room’s overall darkness and shadow directions; later steps clean up soft penumbras and reflective surfaces.
Before vs. After:
- Before: Editors emphasized matching the instruction and keeping content, but physics could be off (e.g., straight straws in water, wrong shadow travel, rigid objects bending impossibly).
- After: Editors still follow instructions and preserve content, but now they also respect how things should realistically change, thanks to learned transition priors and explicit physics reasoning.
Why It Works (intuition): The video-derived priors tell the model which motion/deformation/optical paths are likely. The text reasoning branch anchors the logic (“refraction at boundary,” “cooling reduces gloss, adds frost”). The diffusion process itself is a natural stage-by-stage scaffold for mixing structure-first (DINO) and texture-later (VAE) guidance.
Building Blocks (each as a sandwich):
- 🍞 Hook: Think of a science museum arranged by themes—mechanics, heat, light, materials, life. 🥬 The Concept (PhysicTran38K): What it is: A 38K-clip video dataset grouped by real physics transitions across five domains. How it works:
- Define physics categories and transitions (e.g., refraction, freezing, bending).
- Generate and filter videos; verify against physics principles.
- Write constraint-aware descriptions and pick intermediate keyframes. Why it matters: It teaches the model not just endpoints but the lawful paths between them. 🍞 Anchor: The dataset includes many “insert straw into water” and “turn off lamp” styles of clips with correct in-betweens.
- 🍞 Hook: Like a lab notebook that says, “Light bends at boundaries; soft things squish under force.” 🥬 The Concept (Physically-Grounded Reasoning): What it is: A frozen language-vision model writes down physics hints as text guidance. How it works:
- Read the image and instruction.
- List applicable laws and expected changes.
- Provide structured constraints to steer generation. Why it matters: It prevents logical mistakes (e.g., impossible light paths) even before drawing. 🍞 Anchor: For “dim the lamp,” it reminds the model that shadows broaden and reflections weaken, not just overall darkness.
- 🍞 Hook: Picture carrying a pocket guide of how things usually move or change. 🥬 The Concept (Latent Transition Priors via Transition Queries): What it is: Tiny learnable tokens that store common change patterns learned from video. How it works:
- Add K learnable queries to the condition sequence.
- During training, align them to features from intermediate frames.
- At test time, they predict the missing in-betweens from the source and instruction. Why it matters: They let the model “feel” the dynamics without generating a whole video. 🍞 Anchor: The queries help nudge the straw bend and the light distortion correctly when inserted into water.
- 🍞 Hook: Use a blueprint first, then paint. 🥬 The Concept (DINOv2 structural features): What it is: A teacher that captures big shapes and arrangements. How it works:
- Encode keyframes with DINOv2.
- Compress into a fixed-length structure target.
- Align queries to match these structure cues. Why it matters: Early guidance keeps geometry and global changes physically coherent. 🍞 Anchor: Ensures a collapsing structure folds believably instead of teleporting parts.
- 🍞 Hook: After building the frame, polish the surfaces. 🥬 The Concept (VAE texture features): What it is: A teacher that captures fine appearance (gloss, frost, grain). How it works:
- Encode keyframes with the editor’s VAE.
- Compress into a fixed-length texture target.
- Align queries to capture material-level changes. Why it matters: Late guidance refines real material effects (frosting, browning, shimmer). 🍞 Anchor: Freezing a can adds tiny frost crystals and dulls reflections instead of just tinting it blue.
- 🍞 Hook: Turn the big steering wheel first, then use the fine knobs. 🥬 The Concept (Timestep-Aware Modulation): What it is: Blend structure guidance early and texture guidance late during diffusion. How it works:
- Compute a time-weighted mix of structure and texture features.
- Feed it to the diffusion backbone at each step.
- Smoothly shift from global to local cues. Why it matters: Matches the natural coarse-to-fine process, avoiding abrupt, unrealistic jumps. 🍞 Anchor: When “turning off the lamp,” early steps set overall lighting logic; later steps tune soft glows and reflections on metal.
Together, these parts let PhysicEdit keep the speed and simplicity of single-image editing while inheriting the realism of video-taught physics.
03Methodology
High-level Recipe: Input (source image + instruction) → Textual physics reasoning (frozen MLLM) → Add K transition queries → Diffusion editing with timestep-aware guidance → Output physically plausible edited image.
Step 0: Data that Teaches the In-Betweens (PhysicTran38K) 🍞 Hook: Imagine curating a giant flipbook library of real changes—melting, bending, refracting—so your brain learns the usual moves. 🥬 The Concept (Physics-Driven Data Construction): What it is: A pipeline that builds video examples labeled by physics transitions, checks them with simple principles, and writes grounded descriptions. How it works:
- Define a hierarchy: Mechanics, Thermal, Optical, Material, Biological, with 46 transition types.
- Generate candidate videos (fixed camera) and filter out shaky viewpoints.
- Propose basic physics principles per clip (e.g., “angle in = angle out” for reflection) and verify them on keyframes.
- Keep videos that pass; record contradicted rules as negatives.
- Extract (start, middle keyframes, end) and write constraint-aware reasoning. Why it matters: The model now sees how states evolve, not just before/after snapshots. 🍞 Anchor: For “mirror rotation moves a light spot,” the pipeline keeps clips where the spot shifts obeying reflection and discards ones with impossible jumps.
Step 1: Read and Write Down the Physics First (Physically-Grounded Reasoning) 🍞 Hook: Like writing a checklist before building: “Light bends at water; shadows lengthen when the light dims.” 🥬 The Concept: What it is: A frozen multimodal LLM (Qwen2.5-VL-7B) produces a short physics-aware plan from the image and instruction. How it works:
- Look at the source image.
- Read the instruction.
- Produce a structured text trace: relevant laws, expected causal unfolding, material behavior. Why it matters: Without this step, the edit might satisfy the words but break the laws, e.g., moving a shadow in the wrong direction. 🍞 Anchor: “Turn off the lamp” leads to reasoning mentioning global illumination decrease, softer reflections, and shadow growth away from the light.
Step 2: Pack the Motion Sense into Tiny Helpers (Transition Queries) 🍞 Hook: Think of carrying small cue cards that remind you how things usually change—bend here, frost there, dim that. 🥬 The Concept (Implicit Visual Thinking via Transition Queries): What it is: K small learnable tokens appended after the reasoning text that store learned transition priors. How it works:
- Append K queries to the condition sequence.
- During training, encode intermediate keyframes with two encoders: DINOv2 (structure) and VAE (texture).
- Compress these into target features and align the queries’ projections to them.
- At test time, queries predict features from the source + instruction + reasoning (no video needed). Why it matters: They act like a memory of lawful changes, steering the editor away from impossible states. 🍞 Anchor: For “insert straw,” queries bias features toward a bent straw at the boundary and correct underwater magnification.
Step 3: Two Visual Teachers for Two Kinds of Guidance 3a) Structure Teacher: DINOv2 🍞 Hook: Build the skeleton first. 🥬 The Concept: What it is: DINOv2 provides strong layout and geometry signals. How it works:
- Encode sampled middle frames.
- Compress to a fixed-length structure vector.
- Train queries to match this vector via a projection head. Why it matters: Keeps big shapes evolving plausibly (e.g., folds, rotations, shadow placement). 🍞 Anchor: Rotating a mirror moves the light spot smoothly along the wall instead of teleporting.
3b) Texture Teacher: VAE 🍞 Hook: Polish the skin later. 🥬 The Concept: What it is: The VAE provides fine appearance cues. How it works:
- Encode the same middle frames.
- Compress to a fixed-length texture vector.
- Train queries to match via a second projection head. Why it matters: Captures material changes like frosting, browning, gloss shifts. 🍞 Anchor: Freezing brings subtle frost crystals and duller highlights, not just a white spray.
Step 4: Time the Guidance to the Diffusion Steps (Timestep-Aware Modulation) 🍞 Hook: Turn the big wheel first, then fine knobs. 🥬 The Concept: What it is: Mix structure guidance more at high noise (early) and texture guidance more at low noise (late). How it works:
- Compute ran(t) = t * structure + (1 - t) * texture during training.
- At test time, use the queries’ predicted versions.
- Feed this blended guidance into the diffusion backbone at each step. Why it matters: Smooth handoff from geometry to details prevents abrupt artifacts and matches how diffusion refines images. 🍞 Anchor: “Turn off the lamp” sets darkening and shadow geometry early, then refines metal reflections and fabric grain late.
Step 5: Optimize with Two Losses, Cleanly Separated 🍞 Hook: Grade the blueprint and the paint job separately. 🥬 The Concept (Disentangled Training): What it is: A diffusion loss trains the editor; a transition loss trains the queries and their heads. How it works:
- Standard diffusion/flow-matching loss updates the diffusion transformer and feature extractors.
- A transition loss aligns queries’ predictions to DINO/ VAE targets, weighted by timestep.
- No cross-interference: queries learn dynamics; diffusion backbone learns to render. Why it matters: Prevents the dynamics memory from being muddled with pure rendering quality. 🍞 Anchor: The editor keeps its instruction-following skills, while the queries specialize in physical transitions.
Step 6: Keep it Fast at Inference 🍞 Hook: Study with videos, test with a single picture. 🥬 The Concept (Train with videos, infer with one image): What it is: Use video only during training; at test time, you feed source + instruction. How it works:
- The frozen MLLM writes physics hints.
- Transition queries predict structure/texture guidance from that context.
- Diffusion produces the final edited image. Why it matters: You get the realism of video training without paying a video-generation cost at test time. 🍞 Anchor: On “change foggy to sunny,” the model brightens correctly, clears haze in depth order, and sharpens shadows—without simulating a full video.
Secret Sauce:
- Dual thinking: text rules + visual priors.
- Transition queries: compact, generalizable dynamics memory.
- Timestep-aware mixing: gentle, physics-friendly guidance across the diffusion timeline.
Concrete Walkthrough (Insert Straw into Water):
- Input: Photo of a glass of water; instruction: “Insert a straw.”
- Reasoning: Notes refraction at the air–water boundary and magnification below water.
- Queries: Predict plausible mid-trajectory features: angle change at surface, slight shift of visible straw width underwater.
- Diffusion: Early steps set the bend at the surface (structure), later steps tune ripples, highlights, and glass distortions (texture).
- Output: A straw that visibly kinks at the surface, with correct underwater magnification and reflections on the glass.
04Experiments & Results
The Tests and Why They Matter:
- PICABench: Like a physics report card for edited images. It checks optics (light propagation, source effects, reflection, refraction), mechanics (deformation, causality), and state transitions (global/local). This directly probes whether edits obey physical laws.
- KRISBench: A knowledge and reasoning exam. It splits tasks into factual, conceptual, and procedural knowledge, including temporal perception. It asks, “Does your edit make sense over time and under real science?”
The Competition:
- Proprietary heavyweights: GPT-Image-1/1.5, Nano Banana (and Pro), Seedream.
- Open-source leaders: Qwen-Image-Edit (our backbone), BAGEL (+Think), OmniGen2, FLUX.1 Kontext, Uni-CoT, ChronoEdit, etc.
Scoreboard with Context:
- On PICABench, PhysicEdit scores 64.86 overall. That’s like moving from a solid B (baseline Qwen-Image-Edit at 61.26) to a strong A−, especially in categories driven by hidden dynamics. Notable jumps:
- Light Source Effects: 61.19 → 76.16 (big win on how light sources affect scenes).
- Deformation: +12 points to 60.76 (better at squish, bend, and material response).
- Causality: 48.95 → 59.23 (changes look caused by the right things).
- Refraction and Local State Transition also improve, meaning optics and fine-grained changes are more lawful.
- On KRISBench, PhysicEdit reaches 72.16 overall, topping open-source baselines and beating some proprietary models (e.g., Gemini-2.0, Doubao). Highlights:
- Temporal Perception (Factual): 71.73 → 76.13 (better sense of how scenes evolve in time).
- Natural Science (Conceptual): +11.9 to 71.57 (text reasoning awakens real scientific knowledge during edits).
Surprising/Noteworthy Findings:
- Text vs. Vision Synergy: Adding only the text reasoning helps mechanics (logical constraints), but optics barely moves. Adding only the visual transition queries boosts optics strongly, but mechanics suffers. The full dual-thinking model gets the best of both—logic plus rendering.
- Structure vs. Texture Teachers: DINO-only variants lock structure well (great global transitions) but miss local deformations. VAE-only variants polish textures but can lose coherence. Blending them smoothly over diffusion timesteps beats hard switching.
- Implicit > Explicit Rollouts (for this task): ChronoEdit’s explicit intermediate-frame generation can accumulate errors and costs more compute. PhysicEdit’s latent queries sidestep that, improving both PICABench and KRISBench while remaining efficient.
- No Trade-off for Everyday Editing: On general editing benchmarks (ImgEdit-Bench, GEdit-Bench-EN), PhysicEdit keeps or slightly improves performance over its backbone. So learning physics didn’t erase the model’s everyday skills.
Concrete Examples (Qualitative):
- “Turn off the lamp”: PhysicEdit dims the room consistently, grows and softens shadows in the right directions, and reduces reflective hotspots. Some baselines just darken the whole image uniformly or muddle shadow geometry.
- “Freeze the soda can”: PhysicEdit adds frost with appropriate texture and reduces gloss; naive methods may overlay an icy tint without material realism.
- “Remove the ladder” or “Remove the white stand”: PhysicEdit preserves background structure without warping, thanks to better structural guidance.
Bottom Line:
- PhysicEdit closes the gap between following instructions and obeying physics. It sets a new open-source state of the art on physics-focused editing while staying competitive with top proprietary models, especially in categories that require understanding the in-betweens of change.
05Discussion & Limitations
Limitations:
- Coverage of Physics: Although PhysicTran38K spans 46 transitions, real life is broader. Exotic materials, multiphysics scenarios, and rare edge cases may still trip the model.
- Video-Training Dependence: The approach relies on video-derived supervision quality. If the generated/filtered clips contain subtle artifacts, those biases can seep into the priors.
- Domain Shift: Drastically different camera optics, extreme lighting, or unusual object scales may challenge the learned priors.
- No Explicit Simulation: The method learns priors, not full-blown physics engines. For precise scientific measurements, you still need simulation.
Required Resources:
- Training uses video clips, two frozen encoders (DINOv2, VAE), a frozen MLLM, and a diffusion backbone. The reported setup (LoRA fine-tuning on for ~12 hours) is reasonable but not tiny. Storage for the dataset and I/O throughput matter.
When Not to Use:
- Fantasy or Superpowers: If you want anti-physics effects (e.g., floating rocks, light pushing heavy objects), this model will try to “correct” them.
- Highly Specialized Lab Phenomena: If you need exact thermodynamics or micro-optics beyond common cues, a scientific simulator is better.
- Multi-View Consistency Across Moving Cameras: The method expects single-image inference and benefits from fixed-view training; large camera motion logic isn’t its focus.
Open Questions:
- Adaptive Query Banks: Could dynamic, scene-specific queries outperform a single global set while avoiding overfitting?
- Multiphysics Compositions: How to chain or blend multiple simultaneous transitions (e.g., heating + bending + reflection changes) robustly?
- Real-World Data: What happens when more real videos (not generated) with verified physics are incorporated at scale?
- 3D Awareness: Can adding depth or 3D cues strengthen refraction, shadow casting, and occlusion under more complex geometry?
- Safety & Attribution: How can we watermark or detect physics-aware edits to prevent misuse while preserving creative freedom?
06Conclusion & Future Work
Three-Sentence Summary: This paper reframes image editing as predicting a physical state transition, not just jumping from “before” to “after.” It introduces PhysicTran38K to teach in-between dynamics and proposes PhysicEdit, which blends physics-aware text reasoning with latent transition queries learned from videos. The result is an editor that follows instructions and obeys real-world laws, achieving state-of-the-art open-source performance on physics-heavy benchmarks.
Main Achievement: Distilling video-based dynamics into compact transition queries—and coupling them with explicit physics reasoning—lets a single-image editor produce physically faithful results without running a video at inference.
Future Directions: Expand transition coverage and real-world video diversity; explore adaptive or scene-specific query banks; add 3D signals for better occlusion and light transport; and design robust provenance tools for responsible use. Further, integrate lightweight simulation hints where high precision is needed.
Why Remember This: It marks a shift from “semantic edits” to “cause-and-effect edits,” teaching models the story between frames. That leap—from statics to dynamics—brings AI image editing closer to how the real world actually works, making results more trustworthy for learning, design, and daily life.
Practical Applications
- •AR try-ons that show correct fabric drape, creasing, and shadowing on moving bodies.
- •Product photo edits that respect gloss, frost, and condensation changes under temperature or lighting shifts.
- •Interior design previews with physically accurate light dimming, shadow growth, and reflection changes.
- •Science education visuals that demonstrate real optical effects (refraction, reflection) and thermal processes.
- •Industrial concept imagery that honors material deformation (bend, compression) for early design feedback.
- •Movie and game art passes where quick edits keep realistic light paths and material responses.
- •Safety training scenes with believable fracture patterns, cooling effects, and shadow logic.
- •E-commerce background and lighting adjustments that remain physically consistent across a catalog.
- •Photo restoration and object removal that maintain coherent structure and illumination.
- •Advertising composites where inserted objects match scene physics (shadows, reflections, contact).