Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin; Guoqiang Liang; Ziyun Zeng; Zechen Bai; Yanzhe Chen; Mike Zheng Shou

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Intermediate

Yiqi Lin, Guoqiang Liang, Ziyun Zeng et al.3/2/2026

arXiv

Key Summary

•Kiwi-Edit is a new video editor that follows your words and also copies looks from a picture you give it.
•The team built a huge open dataset (RefVIE) of 477,000 training examples that include source video, instruction, reference image, and the edited target video.
•They invented a pipeline that turns old text-only editing pairs into full training quadruplets by creating smart reference images.
•Their model mixes a language-and-vision brain (an MLLM) with a video artist (a Diffusion Transformer) and connects them with special learnable tokens.
•A hybrid trick keeps the original video’s structure while borrowing textures from the reference so edits look stable and accurate.
•They train in three careful stages so the parts learn to talk to each other before doing hard reference-guided edits.
•On public benchmarks, Kiwi-Edit beats strong open models on instruction-only tasks and matches or surpasses a leading commercial system on some reference-guided tasks.
•They also created RefVIE-Bench, a human-checked test for identity matching, background swaps, and smooth motion.
•This work makes precise, controllable video edits more accessible to everyone and releases data, models, and code openly.

Why This Research Matters

Precise video edits are useful everywhere: ads need exact brand looks, educators need consistent visuals, and creators want unique styles that stay stable across frames. Kiwi-Edit makes this precision possible by letting you say what you want and also show what it should look like. The open RefVIE dataset finally gives the community the missing training fuel to learn reference-following at scale. A unified architecture translates understanding into stable, realistic motion while matching exact textures and identities. With open models, code, and a fair benchmark, more teams can build reliable, controllable video editors. This lowers costs, speeds up production, and empowers creativity for individuals and studios alike.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when you tell a friend, “Draw me a cool car,” you might get something close, but when you also show a photo of the exact car, the drawing suddenly looks much more like what you wanted?

🥬 Filling (The Actual Concept)

What it is: Instruction-based video editing lets you change a video just by typing a sentence, like “Make the sky pink” or “Remove the dog.”
How it works: 1) A model reads your text, 2) it finds the parts of the video to change, 3) it redraws those parts across frames while trying to keep the rest of the video the same.
Why it matters: Without this, you’d need tricky software skills to make edits; with it, anyone can tweak videos using plain language.

🍞 Bottom Bread (Anchor) Example: You type “Add snow to the scene,” and the tool sprinkles snow on all frames while keeping people and buildings unchanged.

🍞 Top Bread (Hook) Imagine explaining a hairstyle with words only: “short but not too short, with side-swept bangs, slightly wavy…” It’s hard! But showing a photo nails it.

🥬 Filling (The Actual Concept)

What it is: Reference-guided video editing uses both your instruction and a picture you provide to lock in the exact look (color, texture, identity) you want.
How it works: 1) You give the video and a sentence, 2) you add a reference image (like a specific hat or background), 3) the model copies the reference’s look into the right spot of the video, 4) it keeps the change steady across frames.
Why it matters: Language alone is fuzzy; references remove guessing, so results match your vision.

🍞 Bottom Bread (Anchor) Example: “Replace the boy’s cap with this classic black fedora.” The model uses the photo to match the fedora’s exact shape and material every frame.

🍞 Top Bread (Hook) Think of building a LEGO set: you need all pieces—missing ones make the build impossible.

🥬 Filling (The Actual Concept)

What it is: Training data quadruplets are complete sets: source video, instruction, reference image, and the final edited video.
How it works: 1) Collect a source/target pair, 2) read the instruction, 3) create or extract the right reference image, 4) bundle them together for training.
Why it matters: Models can’t learn to follow references if the training examples don’t include references.

🍞 Bottom Bread (Anchor) Example: To teach “replace the shirt with the one in this photo,” the training example must include that shirt’s image.

The World Before: Text-only editing became impressively good using diffusion-based video models. You could recolor skies, remove small objects, or apply global styles, often with decent motion stability. But when someone said, “Make this car look like my blue Mustang with white stripes,” models struggled because words can’t fully describe exact stripes, textures, or logos.

The Problem: Precise control—like matching a specific hat, sneaker, or wallpaper pattern—needs a visual example. Yet, almost no large, public datasets paired instructions, references, and target videos all together. Without those full examples, models couldn’t practice reference-following at scale.

Failed Attempts:

Pure text: Ambiguous. “Red dress” could be many shades and styles.
Small, private datasets: Some labs had their own data, but kept it closed, so others couldn’t build on it.
Image-only tricks: Borrowing from image editing helped a bit, but moving objects, occlusions, and camera motion made video harder.

The Gap: We needed lots of high-quality training quadruplets, but creating them by hand is slow and expensive. Also, we lacked a fair test focused on reference-following, not just text understanding.

🍞 Top Bread (Hook) Imagine finding missing puzzle pieces by cleverly cutting them out of pictures you already own.

🥬 Filling (The Actual Concept)

What it is: A scalable data generation pipeline that turns existing instruction-only video pairs into full quadruplets by synthesizing reference images.
How it works: 1) Gather many source–target video pairs with instructions, 2) find the edited region, 3) extract or inpaint a clean reference image, 4) filter for quality and remove duplicates.
Why it matters: This fills the training gap without costly manual work.

🍞 Bottom Bread (Anchor) Example: If the background changed to “winter forest,” the pipeline removes the person and produces a clean winter forest image as the reference.

Real Stakes: In daily life, creators want exact looks—brand outfits in ads, consistent props across shots, and movie scenes that swap locations while keeping actors intact. Teachers could tailor educational videos, small businesses could update product videos quickly, and social media users could personalize clips with precise styles. Better control saves time, reduces cost, and boosts creativity while keeping motion smooth and believable.

02Core Idea

🍞 Top Bread (Hook) You know how a student writes better essays when they both understand the question and have a great example to copy the style from?

🥬 Filling (The Actual Concept)

What it is: The key insight is to combine clear language instructions with a visual example and teach a single model to use both—powered by a new dataset that makes this training possible.
How it works: 1) Build RefVIE, a massive set of quadruplets; 2) use an MLLM to understand the instruction plus what the reference looks like; 3) feed that understanding to a Diffusion Transformer that edits the video; 4) use clever connectors and a hybrid injection method so the video keeps its structure while copying reference textures.
Why it matters: With text alone, models guess; with references and a unified architecture, edits become accurate, consistent, and controllable.

🍞 Bottom Bread (Anchor) Example: “Replace the background with this snowy park.” The MLLM parses the request and the snowy photo; the video generator swaps the background, keeps people intact, and matches lighting and perspective across frames.

Multiple Analogies (3 ways):

Recipe + Photo: The instruction is the recipe, the reference is the photo of the dish, and the chef (model) cooks so the final plate tastes and looks right.
Dress Tailor: Your words say “shorten sleeves, add lace,” but the reference dress ensures the exact lace pattern and fabric are matched.
Map + Landmark: The instruction is the map (“go north two blocks”), the reference is the landmark photo so you don’t miss the exact building.

Before vs After:

Before: Text-only edits often drifted from the exact look users wanted, especially for identity (logos, patterns) or specific backgrounds.
After: With instruction-plus-reference and a model built to use both, the results lock onto exact appearances and keep them steady over time.

🍞 Top Bread (Hook) Imagine a librarian who can read your request, look at pictures you bring, and then guide an artist to redraw scenes accurately.

🥬 Filling (The Actual Concept)

What it is: A Multimodal Large Language Model (MLLM) acts as the understanding engine for text and images.
How it works: 1) It reads the instruction and sees the video frames and reference, 2) it packs the meaning into special tokens, 3) it passes those tokens to the video generator.
Why it matters: Without a strong multimodal brain, the model can’t correctly interpret what to change and what the reference looks like.

🍞 Bottom Bread (Anchor) Example: “Replace the boy’s shoes with these red sneakers.” The MLLM notices color, style, and where shoes appear in frames.

🍞 Top Bread (Hook) Think of a patient artist who paints over a video frame little by little until it looks perfect, then repeats for the next frame.

🥬 Filling (The Actual Concept)

What it is: A Diffusion Transformer (DiT) is the generator that actually redraws the video under guidance.
How it works: 1) Start from a noisy version of the video, 2) use attention to focus on instructions and reference tokens, 3) remove noise step by step, 4) produce a clean edited video.
Why it matters: This engine turns understanding into pixels that look right and move smoothly.

🍞 Bottom Bread (Anchor) Example: After reading “add a brown fedora,” the DiT steadily makes the hat appear clearly on the head in every frame.

🍞 Top Bread (Hook) Imagine asking a teacher questions with little sticky notes labeled “What to do” and “What it should look like.”

🥬 Filling (The Actual Concept)

What it is: Learnable query tokens are special notes; Query Connector and Latent Connector translate notes from the MLLM into signals the DiT understands.
How it works: 1) Query tokens summarize the instruction (what to change), 2) Reference latents carry details from the reference image (how it should look), 3) both feed the DiT’s attention.
Why it matters: Without these connectors, the generator can miss instructions or fail to copy textures exactly.

🍞 Bottom Bread (Anchor) Example: Query tokens tell “swap the coat,” while reference latents deliver the exact blue fabric and buttons to copy.

🍞 Top Bread (Hook) You know how you might lightly trace over the original drawing so you keep the layout, then paste in a fabric sample to match texture?

🥬 Filling (The Actual Concept)

What it is: Hybrid latent injection preserves the video’s structure (element-wise add with a learnable time-based weight) and adds reference features by sequence concatenation for fine texture copying.
How it works: 1) Source video latents are added with a scale that changes over time to keep layout and motion, 2) reference latents are concatenated so the model can “look over” and copy textures.
Why it matters: Without this, edits either wobble the original layout or fail to match the exact look.

🍞 Bottom Bread (Anchor) Example: Keeping the person’s pose and path the same while copying the plaid pattern of a reference scarf perfectly.

Why It Works (Intuition): The MLLM is great at understanding what and where to edit; the DiT is great at drawing. The dual connectors separate “what to do” from “what it should look like,” and the hybrid injection keeps the scene’s bones (structure) while replacing the skin (textures). The new dataset teaches the model countless examples, so it learns how to do this robustly.

Building Blocks:

RefVIE dataset of 477K quadruplets for diverse tasks.
MLLM for multimodal understanding with LoRA for light adaptation.
DiT generator trained with a flow-matching objective for stable learning.
Dual connectors and hybrid injection to balance structure and detail.
Three-stage curriculum so each part learns in the right order.

03Methodology

High-Level Recipe: Input (video + instruction + optional reference) → MLLM encodes meaning → Query & Latent Connectors produce context tokens → DiT edits video with hybrid injection → Output edited video.

Data Generation Pipeline (turning triplets into quadruplets): 🍞 Top Bread (Hook) Think of making a complete lunchbox from leftovers—adding the missing fruit and drink so it’s balanced.

🥬 Filling (The Actual Concept)

What it is: A four-stage automated pipeline that creates reference images and assembles training quadruplets at scale.
How it works:
1. Source aggregation and filtering: Gather millions of instruction-based video pairs and keep only high-quality ones using a score (EditScore).
2. Grounding and segmentation: Use an MLLM to find exactly where the edit happened; refine the region with a smart segmenter (SAM3).
3. Reference synthesis: Extract the edited object or inpaint a clean background to form a crisp reference image.
4. Quality control: Check semantic match with an MLLM, remove duplicates by CLIP features, and drop odd sizes.
Why it matters: Without reliable references and strict filtering, training would be noisy, and models would learn wrong habits.

🍞 Bottom Bread (Anchor) Example: If the instruction was “replace the cat with a white dog,” the pipeline crops the white dog from the target and makes a clean reference image of that dog.

Stage-by-Stage with Examples:

Aggregation & Filtering: Start with ~3.7M samples from open datasets. Use EditScore to keep strong edits (≥6 for general, >8 for reference-focused tasks). Skip weak or confusing samples. Example: Remove a shaky edit where the “red dress” barely changed.
Grounding & Segmentation: Qwen3-VL grounds a box on the edited area in the target’s first frame; SAM3 refines it to pixel-accurate masks. Example: For background swaps, mark and remove the person to isolate the new background.
Reference Synthesis: Use a strong image editor to either crop the edited object onto a clean background (local change) or inpaint a clean background (background change). Example: Cut out the exact blue sneakers used in the target and save as reference.
Quality Control: Ask an MLLM, “Does this reference match the edited result?” Keep high scorers; deduplicate by CLIP embeddings so near-duplicates don’t flood training.

Model Architecture: 🍞 Top Bread (Hook) Imagine a two-person team: a planner who understands your request and an artist who paints, passing sticky-note instructions and fabric swatches between them.

🥬 Filling (The Actual Concept)

What it is: A frozen MLLM (lightly adapted with LoRA) plus a DiT video generator, bridged by Query and Latent Connectors.
How it works:
- MLLM: Reads interleaved source frames, instruction text, and optional reference images; outputs rich features.
- Query Connector: Projects learnable query tokens that summarize the instruction (“what to do”).
- Latent Connector: Projects dense reference tokens (“how it should look”).
- DiT: Receives these as cross-attention context while it denoises video latents.
- Hybrid Injection: Element-wise add source video features with a learnable time scale to preserve layout; concatenate reference features to let attention copy textures.
Why it matters: Each piece has a clear job; together, they keep motion stable and appearances accurate.

🍞 Bottom Bread (Anchor) Example: “Replace the wood table with this round white coffee table.” The planner (MLLM) understands the request and the table’s look; the artist (DiT) swaps the table while keeping hands, shadows, and camera moves consistent.

Training Objective: 🍞 Top Bread (Hook) Picture steering a boat from noisy waves to a calm harbor by always pointing straight toward the goal.

🥬 Filling (The Actual Concept)

What it is: Flow matching trains the model to predict a direct ‘velocity’ from noise toward the target video latent.
How it works: 1) Mix target with noise at a random time, 2) predict the velocity back to target, 3) minimize the squared error.
Why it matters: It stabilizes training and helps the model learn consistent paths from noisy starts to clean edits.

🍞 Bottom Bread (Anchor) Example: From a fuzzy, grainy frame, the model learns to move straight toward a clear frame with the new hat.

Training Curriculum (3 stages): 🍞 Top Bread (Hook) Like learning piano: first learn notes, then play simple songs, then perform with both hands and pedal.

🥬 Filling (The Actual Concept)

What it is: A progressive plan so parts align before hard tasks.
How it works:
1. Alignment: Freeze big backbones, train only LoRA, connectors, and query tokens on text-only image/video edits so the MLLM’s language matches the DiT’s attention needs.
2. Instructional tuning: Unfreeze the DiT; co-train on large, quality image and video edit sets; start at lower resolution and scale up.
3. Reference fine-tuning: Mix in RefVIE quadruplets so the model learns to use reference tokens for identity- and texture-true edits.
Why it matters: Skipping steps leads to confusion—like trying to perform a duet before learning the melody.

🍞 Bottom Bread (Anchor) Example: After stages 1–2 learn ‘remove’ and ‘recolor’ well, stage 3 adds skill in copying a specific logo from a reference shirt.

Secret Sauce (Clever Bits):

Dual-connector design separates “what to do” (queries) from “how it should look” (reference latents).
Hybrid latent injection balances structural preservation (add with time scaling) with texture fidelity (sequence concat).
Scalable data pipeline produces high-fidelity references at scale, which was the missing fuel for training.

04Experiments & Results

The Test: The team measured how well edits followed instructions, matched reference identities or backgrounds, and stayed smooth over time. They used both a popular community benchmark (OpenVE-Bench) and their new human-checked RefVIE-Bench.

🍞 Top Bread (Hook) Think of a talent show where judges score: Did the act follow the rules, look like the example, and keep rhythm the whole time?

🥬 Filling (The Actual Concept)

What it is: Evaluation along three axes—following the instruction, matching the reference (if any), and temporal consistency.
How it works: 1) Automated judging by a strong MLLM with strict rubrics, 2) scores from 1 to 5 on each dimension, 3) overall comparisons against leading systems.
Why it matters: A great single frame isn’t enough; the whole video must stay correct and stable.

🍞 Bottom Bread (Anchor) Example: For “replace with this fedora,” judges check that the hat really matches the photo, sticks to the head across frames, and fits lighting.

Competition: The model was compared to open-source editors like VACE, OmniVideo, InsViE, ICVE, Lucy-Edit, DITTO, OpenVE-Edit, and also to strong commercial systems like Runway Aleph and Kling-O1.

Scoreboard with Context:

Instruction-only (OpenVE-Bench): Kiwi-Edit achieved an overall 3.02, beating the prior open best (2.50). That’s like moving from a solid B to an A− when others are around B.
Background change: 3.84, even surpassing a popular commercial tool’s 2.62 in that category—like getting top marks in the toughest event.
Resolution and training stages: Higher inference resolution ( $1280×704$ ) and the staged training boosted results consistently.
Reference-guided (RefVIE-Bench): Kiwi-Edit trained with RefVIE scored 3.31 overall, slightly ahead of Runway Aleph (3.29). While Kling-O1 scored higher (3.99), Kiwi-Edit sets a strong open-source baseline for reference-following.

🍞 Top Bread (Hook) Imagine a fair test where you must copy a specific sticker (object) or wallpaper (background) into a moving scene and keep it looking natural.

🥬 Filling (The Actual Concept)

What it is: RefVIE-Bench is a curated set with subject references and background replacements, scored for identity, matting, and harmony.
How it works: 1) 110 human-verified cases, 2) identity or background fidelity is primary, 3) temporal and physical scores are capped by identity fidelity to avoid rewarding wrong-but-stable edits.
Why it matters: It tests what really counts in reference editing—matching the given look while staying stable and realistic.

🍞 Bottom Bread (Anchor) Example: Replacing bread with a clay hamburger: judges check the hamburger looks exactly like the reference clay model, moves correctly, and matches lighting.

Surprising Findings:

Stage 3 (reference fine-tuning) improved local object edits most, but slightly reduced background swap scores—likely reflecting dataset bias toward local edits. This suggests balancing data types could improve both.
Adding source features with a learnable time-based scale worked best; channel concatenation underperformed, proving structure-preserving injection is key.
Training image data alongside video helped spatial precision (like neat edges), showing images still teach valuable fine details for videos.

Overall: Kiwi-Edit reliably follows instructions, keeps identities consistent with references, and maintains temporal stability—especially strong on background changes and competitive on subject identity with open weights.

05Discussion & Limitations

Limitations:

Reference diversity: Although 477K examples are large, long-tail looks (rare textures or niche items) may remain underrepresented, reducing performance on unusual requests.
Very long videos: The model samples up to 81 frames during training; extremely long or fast-moving clips may still challenge temporal stability.
Lighting and physics: Perfectly matching complex lighting, subtle shadows, or reflections in dynamic scenes is still hard.
Multiple references: Handling many references at once (e.g., shoes, hat, jacket) may require more specialized conditioning or attention scheduling.

Required Resources:

GPU memory and time: Training or fine-tuning a multi-billion-parameter video model requires multi-GPU setups and careful batching.
High-quality references: Good, clear reference images lead to better matches; low-res or cluttered references reduce fidelity.
Clean prompts: Clear, specific instructions help the MLLM ground edits correctly.

When NOT to Use:

When you need exact frame-accurate compositing with guaranteed pixel-perfect edges for broadcast VFX; traditional pipelines may be safer.
Where legal or ethical constraints forbid swapping identities, logos, or branded items.
For highly technical scientific footage where any hallucination is unacceptable.

Open Questions:

Can we further automate high-quality reference synthesis for complex, multi-object scenes without human checks?
How to best balance data so background swaps and local object edits both excel after fine-tuning?
Can we push temporal consistency on very long clips without huge compute costs?
What are the best ways to incorporate multiple references, 3D cues, or depth to improve lighting and perspective matching?
Can lightweight versions run on consumer devices while keeping strong fidelity?

06Conclusion & Future Work

Three-Sentence Summary: Kiwi-Edit shows that combining instructions with visual references—and training on a massive, newly created dataset of quadruplets—makes video edits more accurate and stable. A smart architecture bridges an MLLM (for understanding) and a Diffusion Transformer (for drawing) with special connectors and a hybrid injection trick that preserves structure while copying textures. The model sets new open-source baselines, and the team releases both data and evaluation tools so the community can build even better editors.

Main Achievement: Turning instruction-only pairs into high-quality instruction–reference–video quadruplets at scale (RefVIE), and using them to train a unified model that strongly follows both text and reference while keeping motion consistent.

Future Directions:

Broaden RefVIE to include more multi-object, multi-reference, and longer video scenarios.
Improve lighting, shadow, and reflection matching using depth or 3D-aware features.
Develop efficient, smaller models for real-time or mobile editing without sacrificing fidelity.
Explore better balance in training to jointly maximize local edits and background swaps.

Why Remember This: It delivers the missing puzzle piece—reliable reference guidance at scale—so AI video editing can match exact looks, not just rough ideas. With open data, code, and a fair benchmark, Kiwi-Edit moves precise, accessible video editing from a cool demo to a robust tool anyone can use.

Practical Applications

•Replace product packaging in ads to match the exact new design using a reference photo.
•Swap a stage background in a music video while preserving performers’ motion and lighting.
•Local wardrobe changes in films (e.g., exact jacket pattern) without reshooting.
•Consistent brand overlays (logos, mascots) tracked across moving scenes.
•Educational content updates (change lab equipment style) while keeping demonstrations intact.
•Social media stylization: apply a specific visual theme or texture bundle across clips.
•E-commerce try-ons: swap shoes or bags to exact catalog references in runway videos.
•Newsroom corrections: blur or replace sensitive items with reference-compliant visuals.
•Game trailers: replace in-game props with final art assets while keeping camera paths.
•A/B testing creatives: quickly generate multiple background settings for the same scene.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes