VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

Narges Norouzi; Idil Esen Zulfikar; Niccolò Cavagnero; Tommie Kerssies; Bastian Leibe; Gijs Dubbelman; Daan de Geus

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

Intermediate

Narges Norouzi, Idil Esen Zulfikar, Niccolò Cavagnero et al.2/19/2026

arXiv

Key Summary

•VidEoMT shows that a single, well‑trained Vision Transformer (ViT) can segment and track objects in videos without extra tracking gadgets.
•It does this by passing object “query” tokens from one frame to the next (query propagation) and mixing them with always‑ready learned queries (query fusion).
•The model is encoder‑only, so everything happens inside one ViT, which makes it very fast and simple.
•On popular video benchmarks, VidEoMT runs 5×–10× faster than complex systems, reaching up to 160 frames per second, with similar accuracy.
•Big, strongly pre‑trained ViTs (like DINOv2) are the key; they already learn stable features that stay consistent across views and time.
•Removing heavy parts (context modules, re‑ID heads, full trackers) barely hurts accuracy when the ViT is strong, but it greatly boosts speed.
•A tiny linear layer plus query fusion is enough to spot new objects while keeping old ones consistent across frames.
•VidEoMT beats or matches state‑of‑the‑art systems on multiple tasks (VIS, VPS, VSS) while using a much simpler design.
•Larger ViTs and stronger pre‑training shrink the gap to the most accurate but heavier models.
•This simplification can make real‑time video understanding practical for many everyday applications.

Why This Research Matters

Many real-world tools—like AR glasses, smart cameras, home robots, and driver-assist systems—need fast, accurate video understanding. By proving that a single, well-pre‑trained ViT can handle both segmentation and tracking, VidEoMT cuts complexity while delivering real-time speed. That simplicity lowers engineering effort, makes maintenance easier, and reduces latency on edge devices. With fewer custom parts, deployment becomes more robust and scalable. Ultimately, this approach helps turn advanced video AI from lab prototypes into everyday, reliable products.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re watching a school play on video. You want to point to each actor, say who they are, color their costume, and keep track of them even when they walk behind the curtain and come back. That’s what video segmentation models try to do.

🥬 Filling (The Actual Concept): What it is: Video segmentation means finding the shapes (masks) of things in each frame, naming them (classes), and keeping their identities the same across time (tracking). How it works: 1) Look at a frame, 2) find objects and draw masks, 3) give each a name, 4) match the same objects in the next frame. Why it matters: Without it, the computer would confuse who is who every moment, like forgetting which actor is which every time the scene changes.

🍞 Bottom Bread (Anchor): Think of a video of two dogs playing. Good video segmentation keeps the brown dog labeled "Dog A" and the white dog "Dog B" the whole time, even if they run around or cross paths.

The World Before: For years, the best video segmentation systems were like giant Lego builds with many special pieces. They had one big piece to do per‑frame segmentation, and then extra pieces for tracking across time. These extra pieces—context‑aware modules, re‑identification heads, and deep trackers—made models accurate but slow and complicated. They also made it harder to train and harder to run on real devices.

🍞 Top Bread (Hook): You know how some students can do lots of subjects well because they have a strong foundation from practicing a lot? Some big Vision Transformers (ViTs) are like that for images.

🥬 Filling (The Actual Concept): What it is: A Vision Transformer (ViT) is a model that reads an image as a grid of patches and processes them like a sentence of tokens. How it works: 1) Split the image into patches (like tiny tiles), 2) turn each patch into a token, 3) use attention to let tokens share info, 4) produce features that describe what’s in the image. Why it matters: Strong ViTs, pre‑trained on huge datasets, can learn powerful and stable features that make many extra parts unnecessary.

🍞 Bottom Bread (Anchor): A ViT can look at a classroom photo and understand desks, backpacks, and students just by reasoning over patch tokens, much like reading a paragraph.

The Problem: Video segmentation needs two superpowers—segmenting each frame and tracking over time. Most methods split this into two teams: a segmenter and a tracker. The tracker adds modules to match objects across frames. But these parts slow things down and complicate the system.

Failed Attempts: People kept adding more modules: context features to handle edges and occlusions, re‑ID heads to push apart different objects and pull together the same ones, and multi‑layer trackers with cross‑attention. These worked but were heavy and slow, like using three remote controls to operate one TV.

🍞 Top Bread (Hook): Imagine if your main backpack was so well organized, you didn’t need lots of extra pouches.

🥬 Filling (The Actual Concept): What it is: EoMT (Encoder‑only Mask Transformer) showed that for images, you can do segmentation by placing a few learnable queries directly into a big pre‑trained ViT, no decoder or extras. How it works: 1) Add learnable queries, 2) run them with the image tokens through the ViT, 3) read out masks and classes from those queries. Why it matters: If a strong ViT can handle image segmentation alone, maybe it can also handle video segmentation with minimal additions.

🍞 Bottom Bread (Anchor): Instead of building a complex tower of blocks, EoMT is like a sturdy single block doing most of the job.

The Gap: But video adds time. Even if a ViT can segment each frame, how do we keep the same object identity across frames without a tracker? If we drop the tracker completely, performance drops because the model can forget who’s who.

Real Stakes: Speed matters in the real world. Think about robots, AR glasses, or cars that need to understand scenes right away. A 10× speedup with similar accuracy is the difference between a safe, smooth experience and a laggy, unreliable one. That’s why the paper asks: can we keep the accuracy but remove the complexity by leaning on a powerful ViT backbone and a tiny bit of temporal glue?

02Core Idea

🍞 Top Bread (Hook): You know how in class, when you move from one slide to the next, you still remember which bullet point you were discussing? That memory helps you stay on track.

🥬 Filling (The Actual Concept): What it is: The key insight is that a single, well‑pre‑trained ViT can do both segmentation and tracking if we simply pass its object queries forward in time (query propagation) and blend them with always‑present learned queries (query fusion). How it works: 1) Segment frame 0 with learned queries; 2) in frame 1, reuse the previous frame’s object queries (so the model remembers), 3) fuse them with learned queries so new objects can still be found, 4) repeat for each frame. Why it matters: Without passing queries forward, the model forgets identities; without fusing learned queries, it struggles to find brand‑new objects.

🍞 Bottom Bread (Anchor): It’s like keeping your place in a book using a bookmark (propagation) while still being ready to add new sticky notes for fresh ideas (fusion).

Multiple Analogies:

Theater analogy: Each actor holds a name card (query) that they carry from scene to scene. Passing the card forward keeps their identity. A spare set of blank cards (learned queries) is always available for new actors who join mid‑play.
Classroom analogy: Yesterday’s seating chart (propagated queries) helps you remember who sits where today, but you also keep empty seats (learned queries) for new students.
Sports analogy: A coach tracks players from quarter to quarter using jersey numbers (propagated), but also watches for substitutes (learned queries) entering the game.

Before vs After:

Before: Big segmenter + big tracker + extra re‑ID and context gadgets. Accurate but complex and slow.
After: One ViT encoder handles everything. A slim linear layer and a simple add operation fuse old and new queries. You keep accuracy and gain huge speed.

🍞 Top Bread (Hook): You know how a habit from lots of practice makes tasks easier?

🥬 Filling (The Actual Concept): What it is: DINO‑style pre‑training gives the ViT features that stay consistent across different views, angles, and even frames. How it works: 1) Show the model different views of the same thing, 2) teach it to make those views have similar features, 3) it learns stable, object‑level patterns. Why it matters: Stable features make it much easier to track the same object over time without extra tracking modules.

🍞 Bottom Bread (Anchor): If you see your friend wearing a hat in different photos, you still recognize them because your brain keeps a stable picture of their face.

Building Blocks (Sandwich style):

🍞 Hook: Imagine labeling toys in a toy box so you can find them next week. 🥬 Concept (Query Propagation): Reuse last frame’s object queries in the current frame so identities persist. Why it matters: Without it, you’d relabel toys from scratch every time. 🍞 Anchor: Keep a list of which toy is in which cubby from yesterday to today.
🍞 Hook: When cooking, you keep salt on the counter even if most spices come from the pantry. 🥬 Concept (Query Fusion): Add learnable queries every frame so new objects are still discoverable. Why it matters: Without it, the model might miss new things entering the scene. 🍞 Anchor: You spot a new ingredient arriving in the kitchen because you always keep an eye out for it.

Why It Works (intuition):

Strong ViT features already encode what stays the same across views; object queries act like handles to pull out complete masks and classes. Passing these handles forward keeps identity. Adding learned queries prevents tunnel vision so the system can still notice newcomers. This balanced memory‑plus‑curiosity is enough to do tracking—no big tracker needed.

03Methodology

Overview (high level): Input video frames → ViT encoder with queries → Query propagation + query fusion across frames → Output masks and classes (with consistent identities)

Step‑by‑Step (with Sandwich explanations for new pieces):

Frame tokenization and features in a ViT

🍞 Hook: Like cutting a big poster into small square tiles to study it piece by piece.
🥬 Concept: The ViT splits the frame into patches, turns them into tokens, and uses attention to learn powerful features. Why it matters: These features are the playground where queries learn which pixels belong to which object. Without good features, masks get messy. Steps: a) Patch embedding, b) add positional info, c) run transformer blocks.
🍞 Anchor: A 720p image becomes many tokens that the ViT mixes and matches to understand the scene.

Learnable queries for segmentation

🍞 Hook: Imagine you have 200 empty name tags ready to assign to objects you find.
🥬 Concept: Learnable queries are vectors that the model uses to propose objects. They interact with the ViT features and produce a class label and a mask per query. Why it matters: Without queries, the model would lack consistent slots to place found objects. Steps: a) Inject queries into last ViT layers, b) update them via attention with image tokens, c) decode masks and classes from them.
🍞 Anchor: Each query becomes, say, “person with blue shirt” plus a mask covering their pixels.

Training objective (classification + mask losses)

🍞 Hook: Think of a scoring rubric: one score for correct name, one for coloring inside the lines.
🥬 Concept: The model uses cross‑entropy for class names and a mix of binary cross‑entropy plus Dice loss for masks. Why it matters: Without both parts, the model could name objects well but draw sloppy masks, or draw neat masks with wrong names. Steps: a) Match ground‑truth objects to queries, b) compute losses, c) update weights.
🍞 Anchor: If a query says “cat” but colors the couch, it gets penalized and learns to adjust next time.

Temporal matching by simple supervision

🍞 Hook: When you assign a locker to a student on day one, they keep that locker number all year.
🥬 Concept: Ground‑truth objects stay matched to the same query across frames after their first appearance. Why it matters: This teaches the model a stable query order, so we can add past queries to current frames safely. Steps: a) First appearance → match to a query, b) keep that match later, c) train the model to respect that ordering.
🍞 Anchor: The same player wears jersey #7 in every game, making it easy to recognize them.

Query Propagation (the memory)

🍞 Hook: Put a sticky note from yesterday into today’s notebook so you don’t forget where you left off.
🥬 Concept: At t=0, use learned queries to segment. For t>0, feed the previous frame’s output queries into the last ViT layers. Why it matters: Without propagation, identities drift or swap. Steps: a) Output queries at t−1 become input “track queries” at t, b) ViT updates them with current frame tokens, c) produce masks/classes.
🍞 Anchor: The same query that found “the brown dog” yesterday helps find the brown dog today.

Query Fusion (the curiosity)

🍞 Hook: Even if you have notes from yesterday, you still leave room for new ideas today.
🥬 Concept: Fuse propagated queries with learned queries using a tiny linear layer plus element‑wise addition. Why it matters: Pure propagation can miss new objects. Fusion keeps the door open for new arrivals. Steps: a) Linear(Q_prev) + Q_learned → fused queries, b) run through ViT, c) decode masks/classes.
🍞 Anchor: A new cyclist entering the frame gets picked up because learned queries are always present.

Inference loop

🍞 Hook: Like reading a comic strip panel by panel while remembering the storyline.
🥬 Concept: For each frame, reuse previous queries (after fusion), run the ViT, output masks/classes, and pass the new queries forward. Why it matters: This simple loop replaces heavy trackers while keeping identities stable and spotting new objects. Steps: a) t=0 use learned queries, b) t>0 fuse propagated + learned, c) decode, d) repeat.
🍞 Anchor: A video of a parade is processed at up to 160 FPS with consistent identities and discovery of newcomers.

What breaks without each step?

Without good ViT features: masks become noisy or confused.
Without learnable queries: no stable slots to place objects.
Without temporal supervision: propagated queries may not align across time.
Without propagation: big identity drops (who is who?).
Without fusion: the model misses new objects that enter later.

The Secret Sauce:

The encoder‑only design allows the model to fully leverage hardware/software optimizations for transformers, avoiding slow custom modules.
DINO‑style pre‑training gives cross‑view consistency that naturally helps tracking.
Query fusion is a minimal, clever tweak—just a linear layer and addition—that balances memory (past) and discovery (present) with almost no overhead.

04Experiments & Results

The Test: Researchers evaluated three things: accuracy (how correct the masks and labels are), consistency over time (do identities stay stable?), and speed (frames per second, FPS). They used standard scores: AP and AR for Video Instance Segmentation (VIS), VPQ and STQ for Video Panoptic Segmentation (VPS), and mIoU plus mVC for Video Semantic Segmentation (VSS). Speed was measured on strong GPUs with modern transformer optimizations.

🍞 Hook: Think of a race where everyone runs laps (speed) while also carrying eggs on spoons without dropping them (accuracy and consistency).

🥬 Concept: The comparison was with top competitors like CAVIS, DVIS, DVIS‑DAQ, and DVIS++. How it works: Run all models on the same datasets and measure AP/AR (VIS), VPQ/STQ (VPS), mIoU/mVC (VSS), plus GFLOPs and FPS. Why it matters: Head‑to‑head tests show if the simpler method really keeps up.

🍞 Anchor: On YouTube‑VIS 2019, VidEoMT reached about 68.6 AP while running around 160 FPS—like getting an A while finishing the test 10× faster than others.

The Competition and Scoreboard (with context):

VIS (YouTube‑VIS 2019/2021/2022, OVIS): VidEoMT matched or came close to the best AP from heavy models, while being 5×–10× (even 14× vs some) faster. For example, against CAVIS on YouTube‑VIS 2019, accuracy stayed comparable (68.6 vs 68.9 AP), but speed jumped from ~15 FPS to ~160 FPS—like going from city bicycle to fast e‑bike without spilling water.
VPS (VIPSeg): VidEoMT gave a small VPQ drop compared to the absolute best (DVIS‑DAQ) but ran ~19× faster. That’s like arriving a minute later but using a simple scooter instead of a complex bus system.
VSS (VSPW): VidEoMT actually improved mIoU over strong baselines while being over 5× faster, and it had better video consistency too (+0.8 mVC). The simpler model wasn’t just fast; it also colored inside the lines more neatly.

Surprising Findings:

Even with no tracker at all (just per‑frame EoMT), the model kept a somewhat consistent query order, hinting that the ViT learned a natural ordering by itself.
Query propagation lifted accuracy without adding cost, but struggled with brand‑new objects; adding query fusion fixed that.
The real‑world speedup was larger than what FLOPs alone suggested. Why? Because a plain ViT can fully benefit from highly optimized transformer kernels, while custom modules become bottlenecks.
Bigger ViTs and stronger pre‑training shrank any accuracy gaps. With DINOv2 or DINOv3, VidEoMT stood toe‑to‑toe with the most accurate systems while staying much faster.

🍞 Hook: Like a student who learned great study habits (pre‑training) and doesn’t need many tutoring sessions (modules) to ace the test.

🥬 Concept: The role of pre‑training and size. How it works: With DINO‑style pre‑training, features are already stable across views, so tracking needs only light glue. Larger ViTs hold more capacity to represent details and temporal cues. Why it matters: If you downsize the model or weaken pre‑training, the benefits shrink and accuracy gaps widen.

🍞 Anchor: A small backpack can’t fit as many books; a lightly trained student needs more tools. Give them a big backpack and plenty of practice, and they can carry the course on their own.

Bottom line: Across multiple datasets and tasks, VidEoMT kept accuracy competitive and unlocked huge speed gains by avoiding specialized tracking modules and running everything inside one well‑trained ViT.

05Discussion & Limitations

Limitations (be specific):

Needs strong pre‑training: Without DINO‑level pre‑training or with very small ViTs, accuracy drops more compared to heavy trackers.
New object detection without fusion: Pure propagation misses late‑arriving objects; fusion is essential.
Very long occlusions or sudden scene cuts: A single‑step propagation can lose track if an object disappears for many frames; there is no big memory bank.
Crowded scenes with many tiny instances: A fixed number of queries may be a bottleneck when objects are extremely dense.
Domain shifts: If the video style is very different from pre‑training data, performance can dip until fine‑tuned.

Required Resources:

Works best with a large ViT (e.g., ViT‑L) and strong self‑supervised pre‑training (e.g., DINOv2, DINOv3, EVA‑02).
Training uses modern GPUs and benefits from transformer optimizations (FlashAttention, compiler graphs). Inference is efficient but still likes a GPU for top FPS.

When NOT to Use:

If you must squeeze maximum possible accuracy on tiny models with weak pre‑training, heavy trackers might still edge out.
If the application needs very long‑term re‑identification across many minutes or cross‑camera tracking, a dedicated memory or re‑ID module may help.
If you need specialized post‑hoc logic (e.g., merging tracks across multiple streams), a pure encoder‑only design may require extra engineering around it.

Open Questions:

Can lightweight, longer‑horizon memory (e.g., a tiny cache over tens of frames) help recover from long occlusions without harming speed?
Can we learn dynamic query counts, growing or shrinking slots per frame based on scene complexity?
What is the best self‑supervised video pre‑training recipe to boost temporal stability even further?
How well does the approach generalize to multi‑camera, 360°, or event‑based videos without new modules?
Can similar encoder‑only ideas simplify other video tasks (e.g., action detection, pose tracking) as much as they did here?

06Conclusion & Future Work

3‑Sentence Summary: VidEoMT proves that one big, well‑trained Vision Transformer can segment and track objects in videos by simply passing object queries forward in time and blending them with learned queries. This encoder‑only design removes bulky trackers and extra modules, keeping accuracy competitive while making the system 5×–10× faster (up to 160 FPS). The result is a cleaner, faster path to real‑time video understanding.

Main Achievement: The paper’s #1 contribution is showing that temporal association can live inside a single ViT encoder using a tiny query propagation + fusion mechanism, eliminating the need for complex, specialized tracking components.

Future Directions: Add lightweight memory for longer occlusions, explore dynamic numbers of queries, and design richer video‑first pre‑training so the model becomes even more robust without extra parts. Extend the encoder‑only principle to related video tasks (e.g., actions, pose) and multi‑camera setups.

Why Remember This: It flips the script—from “add more modules to handle time” to “trust a strong ViT and add a pinch of temporal glue.” That simplicity unlocks real‑time performance without giving up accuracy, making advanced video AI more practical for everyday, on‑device uses.

Practical Applications

•AR eyewear that labels and highlights objects around you in real time without bulky compute.
•Home robots that can track toys, pets, and spills quickly for safer, smarter navigation.
•Retail analytics cameras that follow products and shoppers efficiently while preserving speed.
•Sports broadcasting tools that segment and track players live for instant replays and stats.
•Video editing software that auto-selects and tracks subjects for fast cutouts and effects.
•Autonomous drones that identify and follow targets reliably with low-latency onboard processing.
•Traffic monitoring systems that segment vehicles and pedestrians at city scale in real time.
•Medical video tools (e.g., endoscopy) that segment and follow anatomical structures smoothly.
•Wildlife monitoring that tracks animals across frames without heavy, power-hungry modules.
•Industrial inspection cameras that detect and track defects on moving assembly lines.

Version: 1