Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos
Key Summary
- •This paper builds a giant, automatically made video library called SVG2 that tells who is in a video, what they look like, and how they interact over time.
- •SVG2 contains 636,000+ videos with dense masks, object names, attributes, and relationships, which is far bigger and richer than past datasets.
- •A new model named TRASER turns raw videos plus object tracks into clean, time-aware scene graphs in one pass.
- •TRASER’s key trick is lining up visual tokens with each object’s path (trajectory) and then summarizing them in two smart ways: across the whole object and in short time windows.
- •Compared to strong open-source systems, TRASER boosts relation detection by about 15–20% and object prediction by about 30–40%, and even beats GPT-5 on object and attribute prediction.
- •The dataset and model are fully automatic to produce at scale, using SAM2 for masks, a special online–offline tracker to find new objects, DAM for object descriptions, and GPT-5 to infer relations.
- •When TRASER’s scene graphs are given to a language model for VideoQA, accuracy improves by up to 4.6%, showing that explicit scene graphs help models reason.
- •An LLM-based judge checks predictions using meaning (like synonyms) instead of only exact words, making evaluation fairer for open-vocabulary outputs.
- •SVG2’s scale and TRASER’s design make video understanding more reliable for long, busy scenes with many moving parts.
Why This Research Matters
Videos power our digital world, from safety cameras to sports highlights to classroom lessons, but computers often miss the who–did–what–when story. SVG2 and TRASER give machines a clean, time-stamped map of each scene, which makes answers more accurate and explanations clearer. This helps search engines find the exact moment you want, assists robots to act safely around people and pets, and creates better summaries for accessibility. Because scene graphs are explicit, they make AI easier to audit and debug, building trust. The approach is scalable and open-friendly, so the community can keep improving video understanding together.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine trying to follow a busy playground video: kids run, a dog chases a ball, a bike rolls by, and a parent films it all. You don’t just see objects—you notice who is near whom, who is holding what, and how those things change over time.
🥬 The Concept (Spatio-Temporal Scene Graphs): A scene graph is like a map of a scene that lists objects (nodes), their attributes (like color or size), and their relationships (edges), and a spatio-temporal scene graph adds time so it records how those relationships change as the video plays. How it works:
- Find and segment every object in each frame. 2) Track each object across frames so it keeps the same ID. 3) Describe the object’s attributes. 4) Detect relationships (like holding, following, in front of) and when they happen. Why it matters: Without this structure, AI treats a video like a messy pile of pixels and easily forgets who did what, when. 🍞 Anchor: In a skatepark video, a scene graph can say person-A (red shirt) rides skateboard from 0–5 s, then carries skateboard from 5–8 s, while another person films from the side the whole time.
The world before: Computers were pretty good at spotting objects in single images. But videos are harder: objects move, change size, get occluded, and new objects appear or vanish. Most video datasets had only a few frames labeled per clip, used bounding boxes (which are too coarse for fine details), and rarely marked when relationships started and ended. This made models memorize training quirks instead of truly understanding what happens.
The problem: Building spatio-temporal scene graphs demands three tough ingredients at once: (1) accurate, consistent tracking of every instance across time, even for new arrivals and reappearances; (2) rich, fine-grained attributes (like color, texture, state); and (3) precise, time-localized relationships (who holds what, who passes what, who looks at whom, and for how long). Doing all this by hand is slow and very expensive.
Failed attempts: People tried labeling only a few keyframes and guessing the rest, but this missed objects that popped in later or disappeared early. Some used only bounding boxes, which blur object boundaries (bad for non-rigid things like dogs or flowing water). Others tried detectors to discover new objects during tracking, but identities got mixed up, causing the same object to get multiple IDs.
The gap: We needed a fully automated, scalable pipeline that (a) segments everything precisely (panoptic segmentation), (b) tracks consistently while discovering new objects mid-video, (c) describes each object’s attributes, and (d) infers rich, time-stamped relationships—ideally leveraging strong vision-language models to capture nuance beyond simple 2D geometry.
🍞 Hook (Panoptic Segmentation): You know how some coloring books outline every shape so you can color each part cleanly? That’s what panoptic segmentation does: it outlines every thing and stuff region. 🥬 The Concept: Panoptic segmentation labels every pixel with either a specific object instance (like person #3) or an amorphous stuff region (like sky or grass). How it works: run a powerful segmenter per frame to propose masks for all regions, filter overlaps, and keep high-quality masks. Why it matters: Without clean pixels, fine relationships (like hand-on-handlebar) get lost. 🍞 Anchor: In a biking video, you get separate masks for rider, bike frame, front wheel, back wheel, road, and trees.
Real stakes: Better video understanding matters for video search (“show the clip where the kid hands the ball”), safer robots (don’t bump into the dog behind the box), sports analytics (who passed to whom, when), content creation (auto-highlights), accessibility (text summaries for blind users), and auditing AI (what exactly did the model think happened, and when?).
🍞 Hook (SVG2): Imagine a mega video library where every clip comes with a super-detailed map of who’s where, what they look like, and how they interact—frame after frame. 🥬 The Concept: SVG2 is a massive, automatically built dataset of video scene graphs with panoptic masks, attributes, and time-stamped relations. How it works: It combines strong segmenters, a clever tracker that finds new objects, object describers, and an LLM to infer relations—no humans needed at scale. Why it matters: Without lots of clean, diverse data, models can’t generalize beyond a few lab videos. 🍞 Anchor: SVG2 includes over 636,000 videos and millions of objects, attributes, and relations—like going from a small school library to a national library for video understanding.
🍞 Hook (VLMs): You know a friend who can look at a picture and also explain it in words? That’s what a Vision-Language Model (VLM) is like. 🥬 The Concept: A VLM jointly learns from images/videos and text so it can connect visual patterns to language. How it works: A vision encoder turns pixels into tokens; a language model reasons and speaks. Why it matters: Without the language side, it’s hard to name objects and describe nuanced relations. 🍞 Anchor: A VLM can say “a person in a yellow raincoat holding a blue umbrella while crossing the street.”
02Core Idea
🍞 Hook (The “Aha!”): Think of turning a long, busy video into a neat comic strip that shows characters (objects), their looks (attributes), and their interactions (relations), with clear time stamps.
🥬 The Concept in one sentence: The paper’s key idea is to (1) auto-build a huge, high-quality video scene graph dataset (SVG2) and (2) design TRASER, a model that lines up visual tokens with each object’s path and then smartly compresses them so it can predict complete spatio-temporal scene graphs in a single pass.
Three analogies:
- Parade organizer: First, the pipeline neatly lines up every marcher (object) in order, gives them name tags (attributes), and notes who marches beside or hands things to whom (relations). TRASER is the announcer who reads it all out clearly.
- Movie editor: It cuts a video into object-centric reels and also into short time clips, then stitches both views so the story (who did what, when) is easy to narrate.
- Classroom note-taker: It highlights important details for each student (object-long summary) and also writes quick, time-stamped notes during class (short time windows) so nothing is forgotten.
Before vs After:
- Before: Models got confused by long videos and many objects, struggled with new objects appearing mid-clip, and missed subtle relationship changes.
- After: With SVG2’s scale and TRASER’s trajectory-aware design, the model keeps object identities straight, captures both slow trends (global context) and quick changes (local motion), and outputs a structured scene graph reliably.
Why it works (intuition, not equations):
- Aligning visual tokens to object trajectories removes a big source of confusion: the model knows which pixels “belong together” over time.
- Two complementary summaries—across the whole trajectory and within short time windows—let the model remember the big picture while still catching quick actions.
- Training on a gigantic, diverse dataset teaches the model many realistic patterns, from simple spatial layouts to complex events.
Building blocks (with sandwiches):
-
🍞 Hook (Trajectory-Aligned Token Arrangement): You know how it’s easier to follow a story if you keep each character’s notes in their own folder? 🥬 The Concept: Trajectory-aligned token arrangement groups visual tokens by which object they cover across time, creating a clean, identity-preserving stream per object. How it works: use segmentation masks to assign tokens to objects, sort them by time, and separate streams with special markers. Why it matters: Without this, the model mixes up who is who when scenes get crowded. 🍞 Anchor: All tokens that cover the skateboarder are placed together in time order, so the model cleanly tracks that one person.
-
🍞 Hook (Object-Trajectory Resampler): Imagine summarizing an entire season for one player: strengths, style, and usual teammates. 🥬 The Concept: The object-trajectory resampler condenses all tokens for one object’s whole path into a compact summary that captures stable identity and overall context. How it works: it learns a small set of queries to pull out the most informative bits from the long token stream. Why it matters: Without a global summary, the model forgets who the object is across long timelines. 🍞 Anchor: It keeps the idea “this is the red-shirt rider with a longboard” consistent throughout the clip.
-
🍞 Hook (Temporal-Window Resampler): Think of replay highlights: you zoom in on a few seconds to see the exact move. 🥬 The Concept: The temporal-window resampler creates short, time-local summaries for each object to preserve quick changes in motion or relations. How it works: split the trajectory into windows (like 4-second chunks) and summarize each separately. Why it matters: Without local zoom-ins, the model misses fast actions, like a brief handoff. 🍞 Anchor: It spots the exact 2–3 seconds when a player passes a ball.
-
🍞 Hook (Online–Offline Tracking): Sometimes a new friend joins mid-game—you need to add them to the roster and still keep everyone’s names right. 🥬 The Concept: The online–offline tracker both discovers new objects as they appear and replays the whole video to keep identities stable. How it works: monitor uncovered regions to add newcomers, then redo tracking from each object’s true start. Why it matters: Without this, new objects get missed or old ones get mixed up. 🍞 Anchor: A bicyclist entering at frame 50 gets a proper ID and a full, clean trajectory.
-
🍞 Hook (VLM + GPT-5 reasoning): You know how a clever friend can look at frames and say, “He’s handing her the book now”? 🥬 The Concept: A VLM generates rich descriptions; GPT-5 infers nuanced, time-anchored relations from object IDs, boxes, and sampled frames. How it works: run attribute parsing per object, then query GPT-5 twice—once for deeper 3D spatial relations and once for non-spatial ones like functional, social, motion, and events. Why it matters: Without language-grounded reasoning, many subtle relations remain unlabeled. 🍞 Anchor: It distinguishes “riding” vs “carrying” vs “pushing” the same skateboard at different times.
03Methodology
At a high level: Video → Panoptic masks per frame → Online–offline tracking to get stable trajectories → Per-trajectory descriptions and attribute parsing → GPT-5 relation inference → SVG2 scene graphs → Train TRASER → One-pass prediction of objects, attributes, and relations from new videos.
Phase 1: Panoptic trajectory generation
- What happens: A strong segmenter (SAM2) proposes masks on each frame using multi-scale grid prompts to cover both big and small regions. We filter overlapping proposals to reduce redundancy. Then we run an online–offline tracker: the online part propagates masks and actively looks for uncovered regions (new objects); the offline part replays from each object’s real first appearance to stabilize identities. Lightweight post-filtering removes duplicates and smooths masks.
- Why this step exists: We need clean, per-object, per-frame masks with consistent IDs—otherwise, later steps can’t attach attributes or relations correctly.
- Example: In a park video, SAM2 finds masks for a dog, a frisbee, and two people. At frame 30, a bike enters; the online tracker notices a previously uncovered area and assigns a new ID to the bike. The offline pass restarts each track at its true entry time, ensuring identity stability across the whole clip.
Phase 2: Object description and structured parsing
- What happens: For each trajectory, we pick the 8 frames where the object is most visible and send them (with masks) to DAM-3B-Video to describe the object in detail. A small LLM (GPT-4o-nano) parses each description into an object name and a list of purely visual attributes (e.g., red, glossy, striped). A verifier step checks whether the label is supported by a second segmenter before keeping it.
- Why this step exists: Attributes and names enrich the graph and help with open-vocabulary evaluation (e.g., human vs person).
- Example: “A small, yellow frisbee with a matte finish” becomes object = frisbee, attributes = [yellow, small, matte].
Phase 3: Inter-object spatio-temporal relation extraction
- What happens: We sample frames (e.g., 1 fps) and provide GPT-5 with object IDs, labels, and bounding boxes. We split relation inference into two passes. For spatial relations, we ask for depth-aware, 3D-consistent relations (not trivial left/right). For non-spatial, we cover functional/manipulation, stateful/attachment, motion, social, attentional (including camera), and event-level relations. Each relation is returned with one or more time intervals.
- Why this step exists: Relationships drive the story of the video: who interacts with what, and when. Splitting the passes ensures non-spatial relations aren’t drowned out by spatial ones.
- Example: person-1 riding bike-3 from frames 10–70; then person-1 carrying bike-3 from 71–90; person-2 filming person-1 from 0–90.
SVG2 dataset construction
- What happens: We run this 3-phase pipeline on 636K+ videos (from SA-V and PVD) to get 6.6M objects, 52M attributes, and 6.7M relations. A special 100-video human-annotated test set adds even denser, hierarchical masks and open-vocabulary labels. Spot checks show high accuracy for objects, attributes, and relations.
- Why this step exists: Models need a huge, diverse, and consistent dataset to learn robust video scene graphs.
- Example: A single sports clip might yield dozens of objects, hundreds of attributes, and many relation intervals (passes, tackles, shots, celebrations).
TRASER architecture (one-pass VSG generation)
- Input → trajectory-aligned token arrangement → dual resampler → LLM decoder → scene graph JSON.
Step A: Trajectory-aligned token arrangement
- What happens: Using the masks, we assign vision tokens to the object they cover, sort by time, and separate objects with special markers so each object has a clean token stream.
- Why this step exists: Without identity-aligned tokens, the model wastes attention figuring out which pixels belong to which object.
- Example: All tokens covering the dog’s pixels across frames are grouped into a single stream marked “Object 7”.
Step B: Object-trajectory resampler (global summary)
- What happens: A small set of learnable queries extracts a compact, high-level summary of each object’s entire trajectory.
- Why this step exists: It preserves who the object is over long time spans (e.g., “red-shirt person”) and gives the model stable identity context.
- Example: It remembers the skateboarder’s outfit and that they are near the ramps for much of the clip.
Step C: Temporal-window resampler (local details)
- What happens: We split each object’s token stream into short windows (e.g., 4 seconds) and summarize each chunk separately to capture quick motions and relation changes.
- Why this step exists: Fine-grained timing matters: when exactly did passing, holding, or looking happen?
- Example: Window 3 (seconds 8–12) captures the exact moment the frisbee is thrown.
Step D: Structured decoding
- What happens: The language model receives a compact, object-aware, time-aware token set and is prompted to output a structured scene graph (objects, attributes, time-stamped relations) directly.
- Why this step exists: Free-form text can be vague; a structured JSON graph is precise and machine-usable.
- Example: It emits entries like [subjecd, relation, objecd, [[start, end], ...]].
The secret sauce
- Identity first, then compress: By aligning tokens to object identities before summarizing, TRASER avoids mixing signals across objects.
- Two summaries, two time scales: The global summary keeps who-is-who steady; the windowed summary catches fast actions.
- Scale meets structure: Training on SVG2’s breadth teaches the model both common and rare patterns, from static spatial layout to dynamic, multi-step events.
04Experiments & Results
The test: Can models, given ground-truth object trajectories, correctly predict objects, attributes, relations, and full triplets (subject–relation–object with the right time spans) across several benchmarks (PVSG, VidOR, VIPSeg, and the rich SVG2 test set)? The team also checks whether TRASER’s scene graphs help a separate language model answer video questions better.
The competition: TRASER is compared to strong proprietary models (like GPT-4.1, Gemini-2.5 Pro, and GPT-5) and leading open-source VLMs (Qwen-family, InternVL, GLM, MiniCPM). To keep things fair, all models receive the same object trajectories and bounding boxes and must output JSON scene graphs. There’s also a control where a baseline open-source model is fine-tuned under similar settings.
Making scores meaningful: Exact-word scoring is unfair in open-vocabulary settings (e.g., human vs person). So an LLM judge compares meanings: identical, synonym, hypernym/hyponym, or semantic overlap. It also checks that the predicted relation’s time window overlaps the ground truth enough. Agreement with human annotators is high, so this is a reliable, practical metric.
Scoreboard (with context):
- Object prediction: TRASER is like consistently getting an A when others get B’s, boosting accuracy by about 30–40% over strong open-source baselines and even surpassing GPT-5 on object naming. This shows the identity-aware design works.
- Relation detection: TRASER improves by about 15–20% over top open-source baselines. Think of it as correctly spotting many more “who did what to whom, when” connections others missed.
- Attributes: TRASER raises attribute prediction by about 15% over open-source state of the art and tops GPT-5 on this too, showing that per-trajectory attribute parsing and training data scale help fine-grained description.
- Triplets: End-to-end subject–relation–object with timing also improves substantially (roughly +15% over open-source models), reflecting stronger, more coherent graphs.
Surprising findings:
- Bigger is not always better alone: Even very capable proprietary VLMs trail TRASER on some object/attribute tasks. When structure (trajectory alignment and resampling) matches the problem, a targeted model can win.
- Data scale matters a lot: Using the full SVG2 (including PVD) lifts performance across tasks, especially attributes, confirming that massive, diverse, well-structured training data is key.
- Token alignment beats simple fine-tuning: Fine-tuning a baseline with the same data helps, but still lags behind TRASER. The architectural choices (trajectory-aligned tokens + dual resampling) provide unique gains.
VideoQA boost: If you give a language model (like GPT-4.1) both the video frames and TRASER’s scene graphs, it answers questions better than with video alone—up to a 4.6% absolute gain on a diagnostic benchmark. Even using only the scene graph text (no video), the system answers a notable chunk of questions correctly, showing the graphs capture genuinely informative structure.
Generalization and robustness:
- Long videos: TRASER wasn’t trained on very long clips, yet it still holds up reasonably well on longer datasets. Object accuracy stays healthy; it’s the timing of complex relations that gets harder—suggesting future training on longer sequences would help.
- End-to-end (no ground-truth masks): Using automatically produced trajectories lowers scores (as expected), but TRASER remains competitive with strong API models that do have ground-truth inputs. This shows the pipeline’s tracking quality is good enough for meaningful downstream reasoning.
Takeaway: The combination of massive, well-structured training data (SVG2) and an object- and time-aware model (TRASER) significantly advances open-vocabulary video scene graph generation and makes downstream video reasoning measurably better.
05Discussion & Limitations
Limitations:
- Synthetic roots: SVG2 is created by automated models (segmenters and LLMs), so it inherits their biases and occasional mistakes. While human spot checks show high accuracy, there will be noise, especially in edge cases (tiny objects, heavy occlusion, tricky social cues).
- Long-horizon complexity: Very long videos with many interacting objects still stretch the temporal-window resampler and the decoder; fine-grained timing can drift.
- Open-world corner cases: Some rare relations or unusual objects may be underrepresented, even with SVG2’s size.
Required resources:
- Training TRASER at scale benefits from modern GPUs (e.g., ) and a sizable storage budget for SVG2. Inference is efficient compared to full video-token floods because of the resamplers, but per-video segmentation/tracking still takes compute if you run end-to-end.
When not to use:
- If your task needs precise 3D measurements (like exact distances or metric motion), a 2D video scene graph may be too approximate.
- If your videos lack clear objects (e.g., medical scans without distinct anatomical segmentation) or require specialized domain labels, you may need domain-specific training.
- If exact pixel-perfect masks are mandatory downstream, minor mask artifacts after tracking could be a blocker.
Open questions:
- Longer videos and hierarchical time: How can we scale temporal windows or add multi-scale temporal pyramids to better capture minute-level and hour-level stories?
- Interactive relations and causality: Can we more explicitly disentangle cause–effect chains and multi-agent coordination (e.g., pass → shot → score)?
- Multimodal cues: How can audio (speech, crowd noise) and text (on-screen signs) be integrated to refine relations and events?
- Reliability and safety: Can we quantify and reduce label noise automatically, and provide uncertainty estimates so downstream systems know when to double-check?
- Real-to-sim synergy: How can improvements to models feed back to make the next SVG version even cleaner, closing the loop faster?
06Conclusion & Future Work
Three-sentence summary: This paper introduces SVG2, a giant, fully automatic dataset of video scene graphs with precise object masks, attributes, and time-stamped relations. On top of it, the authors build TRASER, a model that aligns tokens to each object’s path and summarizes them globally and in short windows, producing accurate spatio-temporal scene graphs in one pass. Together, they deliver large gains over strong baselines and even help separate language models answer video questions more accurately.
Main achievement: Showing that scale plus structure wins—SVG2’s breadth and TRASER’s object- and time-aware design together make open-vocabulary video scene graph generation markedly more reliable and useful.
Future directions: Train with longer videos and multi-scale temporal modules to boost long-horizon relation timing; fuse audio and text overlays to enrich event-level understanding; tighten the end-to-end loop so better models further refine the dataset, creating a virtuous cycle; extend hierarchical annotations to reason about parts and wholes more explicitly.
Why remember this: It marks a turning point where video understanding moves from fuzzy impressions to crisp, checkable maps of who, what, where, and when—at scale. That clarity not only improves accuracy today but also builds trust, debuggability, and compositional reasoning that future multimodal systems can build upon.
Practical Applications
- •Smart video search: Jump directly to the moment when a person hands a package to another.
- •Sports analytics: Automatically detect passes, shots, and defensive interactions with timestamps.
- •Home robots: Navigate safely by understanding who is moving where and what they’re holding.
- •Security review: Flag unusual interactions (e.g., entering restricted areas) with clear object IDs and times.
- •Education: Auto-generate step-by-step lab procedure summaries from classroom experiment videos.
- •Content creation: Build highlight reels by extracting event-level relations (e.g., 'scores a goal').
- •Accessibility: Provide text scene-graph summaries for blind or low-vision users.
- •Retail analytics: Understand shopper–product interactions (picking up, returning, comparing).
- •QA assistants: Feed scene graphs into language models to answer complex video questions more reliably.