TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Linli Yao; Yuancheng Wei; Yaojie Zhang; Lei Li; Xinlong Chen; Feifan Song; Ziyue Wang; Kun Ouyang; Yuanxin Liu; Lingpeng Kong; Qi Liu; Pengfei Wan; Kun Gai; Yuanxing Zhang; Xu Sun

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Intermediate

Linli Yao, Yuancheng Wei, Yaojie Zhang et al.2/9/2026

arXiv

Key Summary

•This paper teaches AI to write movie-like scripts for videos by adding exact timestamps and rich details about what you see and hear.
•It introduces a new task called Omni Dense Captioning that splits a video into scenes and describes each one across six clear parts.
•A new benchmark, OmniDCBench, provides high-quality human-written examples to check if models really understand videos over time.
•A new score, SodaM, fairly measures both when a scene happens (time) and how complete the description is (details).
•The model TimeChat-Captioner-7B learns in two steps: first copy good examples (SFT), then improve with rewards (GRPO) that care about time and detail.
•TimeChat-Captioner beats strong systems (even Gemini-2.5-Pro) on the main metric for this task.
•The captions it writes also help other tasks like answering questions about videos and finding exact moments (temporal grounding).
•Key tricks include mixing audio and video tokens in time order and using special position clues so the model knows 'when' things happen.
•Limits include handling very long videos and fitting everything into a 32K token window; future work plans to compress tokens and train on longer content.
•All data, models, and code will be released to help the community build better video understanding systems.

Why This Research Matters

Videos are everywhere—classes, news, sports, movies—and we need AI that can explain them clearly and in order. Script-like, time-aware captions make videos searchable by exact moments, so you can jump to the part you need. They also help people with visual or hearing impairments experience the story through detailed text. In education, the six-part structure teaches the language of film—camera moves, editing, dialogue, and sound—making media literacy stronger. For safety and analysis (sports, surveillance, audits), precise timing plus detailed descriptions reduces confusion and speeds up investigations. And for creators, such scripts can guide or control video generation and editing tools, unlocking smarter creative workflows.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you can retell a movie to a friend by saying what happened first, what came next, who talked, what the music felt like, and where the camera looked? That’s how our brains make videos into stories.

🥬 Filling: The Concept — Audio-Visual Captioning

What it is: Teaching AI to write sentences that describe what’s happening in a video, using both what it sees (visual) and what it hears (audio).
How it works:
1. The model watches frames (pictures over time) and listens to the audio.
2. It finds people, objects, actions, and sounds.
3. It writes a description in text.
Why it matters: Without captions, videos are just pixels and waves; AI can’t explain them or answer questions about them. 🍞 Anchor: Imagine a clip of a dog chasing a ball while kids cheer. A good caption says, “A brown dog runs after a red ball while children clap and laugh.”

🍞 Hook: Think about making a timeline of your school day—math at 9:00, lunch at 12:00. The times help you tell the story in order.

🥬 Filling: The Concept — Temporal Grounding

What it is: Linking parts of the description to the exact times they happen in the video.
How it works:
1. Split the video into intervals.
2. Mark each interval with start and end times.
3. Attach the right sentence to the right interval.
Why it matters: Without timing, AI mixes moments up, like saying the bell rang before class started. 🍞 Anchor: “00:10–00:18: The dog catches the ball; the crowd cheers louder.”

🍞 Hook: Imagine arranging a play: one scene in the classroom, the next on the playground.

🥬 Filling: The Concept — Scene

What it is: A chunk of video where the time, place, or story stays consistent.
How it works:
1. Notice changes in setting, action, or dialogue.
2. Group shots that tell one mini-story.
3. Put a clear boundary before the next mini-story.
Why it matters: Without scenes, descriptions get jumbled and confusing. 🍞 Anchor: A car conversation inside the vehicle is one scene; cutting outside to a driveway with trees is another.

🍞 Hook: Picture reading a paragraph that says everything happened “sometime later.” You’d be lost!

🥬 Filling: The Concept — Timestamps

What it is: Labels like “00:34–00:41” that pin events to exact times.
How it works:
1. Detect when a scene starts and ends.
2. Write those times in minutes:seconds.
3. Use them to order descriptions.
Why it matters: Without timestamps, the AI can’t track order or duration. 🍞 Anchor: “00:34–00:41: Overhead shot of a white car circling; driver complains in an impatient tone.”

The World Before: Most captioning systems wrote one big paragraph per video with no timestamps. That made them miss the order of events and left out key audio clues—like tone shifts, music changes, or engine sounds. Dense video captioning tried to add time, but usually stayed visual-focused and short—catching only “big” events, not the steady stream of little details that make a scene feel real.

The Problem: We needed a way to write script-like captions that are both time-aware and detail-rich across audio and visuals—so reading them feels like watching the video in your head.

Failed Attempts:

Global captions (no times) make the model fuzzy about “when.”
Sparse event captions miss small but important details (like a quick glance or a sudden music change).
Some methods added audio but still summarized too briefly, skipping camera moves or editing style.

The Gap: No open-source system, dataset, or metric tightly combined: continuous timestamps + detailed six-part descriptions + fair evaluation of both time and content.

Real Stakes:

Accessibility: People who are blind or hard of hearing can “see” and “hear” the story through text.
Education: Students learn film language (camera, editing) and narrative structure.
Search: “Find the exact moment when the driver admits he borrowed the car” becomes easy.
Safety and sports: Knowing precisely when and what happened can matter a lot.

02Core Idea

🍞 Hook: Imagine a movie script that not only tells you what happens but also exactly when, what the camera does, who says what, and how the sound feels.

🥬 Filling: The Concept — Omni Dense Captioning

What it is: A new task that splits a video into continuous scenes and writes time-stamped, richly detailed captions across six parts: Events, Background, Camera, Shot Editing, Dialogue, and Acoustic cues.
How it works:
1. Segment the video into scenes with timestamps.
2. For each scene, fill in the six-part description.
3. Repeat until the whole video becomes a script-like story.
Why it matters: Without this structure, models miss timing or skip key details, making their “stories” incomplete. 🍞 Anchor: “00:31–00:43: Inside the car, the driver speaks in an impatient tone (Acoustics); the camera cuts to a close-up of the backseat passenger (Camera); dialogue lines are written with who said what (Dialogue).”

Three Analogies:

Museum tour: The guide (the model) leads you room by room (scenes), noting artwork (events), room decor (background), where you look (camera), how rooms connect (editing), what people say (dialogue), and the room’s ambiance (acoustics).
Comic book: Each panel (scene) has drawings (visuals), speech bubbles (dialogue), motion lines and zooms (camera), transitions between panels (editing), and sound words like “vroom” (acoustics), all in order.
Cooking show script: Timestamps mark steps, camera angles show close-ups of chopping, background shows the kitchen, dialogue is the chef’s instructions, and sizzling sounds are noted.

Before vs After:

Before: One paragraph per video, fuzzy timing, weak audio, and missed camera/editing language.
After: A scene-by-scene screenplay with exact timing and six-part coverage that captures the full feel of the video.

Why It Works (intuition, no equations):

Structure reduces confusion: six labeled boxes guide the model to “look” for all key aspects.
Time anchors lock events into order, preventing mix-ups.
Audio-visual alignment improves when tokens are processed in time order, like listening and watching together.
Training with rewards that care about both time and content nudges the model to balance accuracy and detail.

🍞 Hook: You know how tests need clear answer keys to be fair?

🥬 Filling: The Concept — OmniDCBench (Benchmark)

What it is: A carefully human-annotated set of videos with scene timestamps and six-part captions.
How it works:
1. Experts segment videos and write detailed six-part descriptions.
2. Double-check timing and content quality.
3. Use these as the gold standard to test models.
Why it matters: Without a reliable benchmark, we can’t tell if models truly improved. 🍞 Anchor: If the model says “overhead shot of a circling car at 00:34–00:41,” we compare against the human-written ground truth for both timing and details.

🍞 Hook: Grading an essay about a timeline needs to check both facts and dates.

🥬 Filling: The Concept — SodaM Metric

What it is: A new scoring method that judges timestamp accuracy and description completeness together.
How it works:
1. Align predicted scenes to reference scenes using time overlap (dynamic programming helps match them fairly).
2. Score the time overlap.
3. Check if key details in each of the six parts appear in the model’s text.
4. Combine into a final score.
Why it matters: Without joint scoring, a model could get good at time but bad at details—or vice versa. 🍞 Anchor: If your caption is at the right time and includes “overhead shot,” “engine hum,” and “impatient tone,” your SodaM score rises.

🍞 Hook: Imagine a student who first copies good notes, then practices with a coach who gives point-by-point rewards.

🥬 Filling: The Concept — TimeChat-Captioner Model

What it is: A video-language model trained to write time-aware, six-part captions scene by scene.
How it works:
1. Step 1 (SFT): Learn the format from many examples.
2. Step 2 (GRPO): Improve using rewards for format, length, timestamp accuracy, and time-aware detail.
3. Special token design: Mix audio and video tokens in time order and add temporal position clues.
Why it matters: This combo teaches both “what to write” and “when it happens,” without forgetting audio or editing style. 🍞 Anchor: After training, the model can script a car scene with exact times, who spoke, camera moves, and background music tone.

Building Blocks:

Six-part schema (Events, Background, Camera, Shot Editing, Dialogue, Acoustic)
Time-aware token mixing (interleaving)
Temporal position encoding (so the model knows order)
Two-stage training (SFT then GRPO with multi-part rewards)
SodaM evaluation (measures the right things)

03Methodology

At a high level: Video with audio → Segment into scenes with timestamps → For each scene, write six-part captions → Output a script-like list of scenes.

Step-by-step recipe:

Input preparation:

Sample video frames at 2 FPS and extract audio features.
Interleave audio and visual tokens in time order so the model “watches and listens” together.
Use temporal position clues so the model knows the order of moments. Why this matters: If audio and video are separate or time is unclear, the model confuses who spoke when or which sound matches which action. Example: The impatient tone must align with the driver’s complaint while the car circles; time-matched tokens make that link obvious.

🍞 Hook: Like shuffling a deck so red and black cards alternate, keeping a rhythm. 🥬 Filling: The Concept — Interleaved Audio-Visual Tokens

What it is: A way to mix audio and video tokens so they appear in the same timeline.
How it works: Tokenize frames and audio chunks, then arrange them as AVAVAV… along time.
Why it matters: Without interleaving, the model might hear a beep but not know which frame it belongs to. 🍞 Anchor: When the engine hum increases right as the car speeds up, interleaving helps the model notice that match.

🍞 Hook: You know how a comic book has panels in order so you don’t read the ending first? 🥬 Filling: The Concept — Multimodal Rotary Position Embedding (M-RoPE)

What it is: A method to give each token a sense of “when” it is in the sequence.
How it works: Add position signals to tokens so the model tracks time as it reads.
Why it matters: Without position clues, scenes blur together. 🍞 Anchor: The camera pan at 00:34 comes before the close-up at 00:40; M-RoPE helps the model keep that order.

Scene segmentation and formatting (SFT stage):

Train the model with supervised fine-tuning to copy the correct output format: [timestamp] + six-part scene descriptions.
The model learns to detect scene changes and fill each part. Why this matters: If the model doesn’t learn the format, outputs become messy and unparseable. Example: “00:34–00:41 | Events: car circles | Background: manor driveway & trees | Camera: overhead to close-up | ShotEdit: cut between outside and inside | Dialogue: driver asks to stop showing off | Acoustics: light, cheerful music + impatient tone.”

🍞 Hook: Practicing with an answer key helps you learn the layout of a math test. 🥬 Filling: The Concept — Supervised Fine-Tuning (SFT)

What it is: Teaching the model using inputs paired with correct outputs.
How it works: Show a video and its gold script; train the model to predict the next correct token.
Why it matters: Without SFT, the model won’t reliably produce structured, six-part captions. 🍞 Anchor: After SFT, the model knows to always include Dialogue and Acoustic sections, not forget them.

Reward-guided improvement (GRPO stage):

The model samples multiple candidate scripts for a video.
Each candidate gets rewards: • Format reward: Is it valid and parseable? • Length reward: Is it long enough to be complete but not rambling? • Timestamp reward: Are scene times accurate? • Time-aware caption reward (SodaM): Do six parts cover key details at the right times?
The model updates itself to prefer candidates with higher combined rewards. Why this matters: SFT learns format, but GRPO sharpens timing and detail quality. Example: If a candidate misses the “overhead shot” or gets times off by several seconds, rewards drop and the model corrects next round.

🍞 Hook: Like a coach who scores your practice runs so you steadily improve. 🥬 Filling: The Concept — Group Relative Policy Optimization (GRPO)

What it is: A reinforcement learning method that improves the model by comparing a group of its own answers and boosting the better ones.
How it works: Generate several outputs, score them, measure each against the group average, then nudge the model toward the winners.
Why it matters: Without GRPO, the model might stick to okay-but-not-great habits from SFT. 🍞 Anchor: The model learns that scripts with precise times and full six-part details beat vague, off-timed scripts.

🍞 Hook: Think of house rules in a board game—follow the format, keep turns reasonable, score points for being accurate, and bonus for capturing story flavor. 🥬 Filling: The Concept — Reward Design (Format, Length, Timestamp, SodaM)

What it is: A set of scores guiding the model’s behavior.
How it works: Check if the output parses (format), isn’t too short or too long (length), aligns with ground-truth times (timestamp), and covers key details in the right scenes (SodaM).
Why it matters: Without these, the model could ramble or skip vital details. 🍞 Anchor: A candidate that nails times but forgets dialogue gets docked; one that includes dialogue at the right moment earns more points.

🍞 Hook: Matching socks from two mixed-up baskets is tricky—you need a smart system. 🥬 Filling: The Concept — Dynamic Programming Alignment (inside SodaM)

What it is: A way to optimally match predicted scenes to reference scenes using time overlap scores.
How it works: Build a grid of overlaps (IoU) and find the best path pairing scenes.
Why it matters: Without careful matching, extra or fewer scene splits would be unfairly punished. 🍞 Anchor: If the model makes two short scenes where the reference has one longer scene, the method can merge and score them fairly.

Data pipeline:

Training data (TimeChatCap-42K) is synthesized with a two-step prompt: rough segments first, then rich six-part captions.
Low-quality samples are filtered: missing audio, single-scene clips, bad formatting, too-short segments.
Evaluation uses OmniDCBench (human-annotated) to avoid training-test overlap.

Secret sauce:

Six-part structure ensures full coverage.
Interleaving + temporal position signals tie “what” to “when.”
GRPO with task-specific rewards steers the model to be both precise and complete.

04Experiments & Results

The Test: Does the model produce the right scenes at the right times, and describe each scene fully across all six parts (Events, Background, Camera, Shot Editing, Dialogue, Acoustic)? We measure timing, detail coverage, and a combined score (SodaM).

The Competition: Closed-source models (e.g., Gemini-2.5-Pro/Flash), open-source general video-language models (Qwen2.5-Omni, MiniCPM-o-2.6, video-SALMONN-2), and expert time-aware models (LongVALE, TimeSuite, TimeExpert).

🍞 Hook: Imagine measuring how well puzzle pieces fit by both shape (time) and picture (content). 🥬 Filling: The Concept — Intersection over Union (IoU)

What it is: A number showing how much two time intervals overlap.
How it works: Overlap length divided by total covered length.
Why it matters: Without overlap scoring, we can’t tell if predicted times match ground-truth times. 🍞 Anchor: If the model says 00:34–00:41 and the truth is 00:32–00:42, there’s strong overlap, so IoU is high.

🍞 Hook: Teachers like checklists to see if you included all the parts of an essay. 🥬 Filling: The Concept — CheckList Score

What it is: A way to see if key details (per dimension) appear in the model’s caption.
How it works: Break ground-truth into atomic points (e.g., “overhead shot,” “impatient tone”), then count how many the model mentions.
Why it matters: Without a checklist, “close enough” wording might miss vital elements. 🍞 Anchor: If the truth lists “engine hum,” “cheerful music,” and “impatient tone,” and the model includes all three, the score is high.

🍞 Hook: Report cards often say how many questions you got right out of all the important ones. 🥬 Filling: The Concept — F1 Score

What it is: A balance of precision and recall (getting things right and not missing things).
How it works: Combines how exact and how complete the predictions are.
Why it matters: Without balance, you could be picky (precise) but miss lots (low recall), or be generous (high recall) but include wrong stuff (low precision). 🍞 Anchor: For timestamps, F1 across IoU thresholds shows how well the model finds scene boundaries overall.

Scoreboard (context-rich):

On OmniDCBench, TimeChat-Captioner-7B-GRPO hits a SodaM score of 35.0—like earning an A when many others are getting C’s or B-’s—and even surpasses Gemini-2.5-Pro on the dense, time-aware caption quality metric.
Scene boundary F1 is 61.2 with strong mean IoU, second only to a top proprietary model for segmentation and ahead of other open models.
On VideoQA (Daily-Omni, World-Sense), captions from TimeChat-Captioner help answer questions better than other open baselines (52.8 and 22.6). Think of it as giving the quiz master a better study guide.
On temporal grounding (Charades-STA), after fine-tuning, it beats expert models and the Qwen2.5-Omni baseline across R@0.3/0.5/0.7 and mIoU—showing that learning to write time-aware scripts teaches excellent time sense.

Surprising Findings:

A small GRPO set (about 2K samples) with smart rewards improves more than simply doubling SFT data—quality beats quantity when the rewards focus on what matters.
The six-part structure reduces hallucinations (e.g., misidentifying a driver’s gender) by forcing attention to grounded details in each category.
Merging many-to-one scene matches during evaluation prevents unfair penalties and reflects how humans sometimes split or merge scenes differently.

05Discussion & Limitations

Limitations:

Long videos (like hour-long content) still challenge the model; the 32K token window can’t fit all frames plus long script outputs.
Generalization to very different durations isn’t perfect; the model may prefer familiar segment lengths.
Synthetic training data, even when filtered, can carry subtle biases from the generator; human annotation at scale is costly.

Required Resources:

A multi-GPU setup for training (the paper used $32×80G$ GPUs), long-context handling, and audio-visual preprocessing.
Time-synced audio-video inputs (2 FPS sampling works in practice) and careful token budgeting.

When NOT to Use:

Ultra-long, live streams where near-real-time, hour-spanning coverage is needed without chunking.
Domains with strict factual accuracy demands but no tolerance for minor timing slips (e.g., legal evidence analysis without human verification).
Low-audio-quality videos where acoustic cues are unreliable or missing.

Open Questions:

How to maintain time-aware detail for hour-long or multi-hour videos without losing context—can token compression or hierarchical memory fully solve it?
How to avoid bias and hallucination when the six-part structure nudges the model to “fill every box” even if some parts are sparse in a scene?
Can we unify captioning with video generation, so structured scripts directly guide multi-scene video synthesis?
What are the best reward balances across format, length, timestamp, and detail for different genres (movies vs. sports vs. lectures)?

06Conclusion & Future Work

Three-sentence summary: This paper defines Omni Dense Captioning, a new way to script videos with continuous timestamps and rich six-part audio-visual descriptions. It builds OmniDCBench and the SodaM metric to evaluate both timing and detail, and introduces TimeChat-Captioner, a two-stage trained model that excels at this task. The resulting captions also strengthen video QA and temporal grounding.

Main achievement: Showing that a structured, time-aware, six-part script approach—plus reward-guided training—beats strong baselines (even proprietary ones) on dense, time-aware caption quality.

Future directions: Extend context windows and apply token compression to handle long videos; collect more diverse long-form data; explore tighter coupling with video generation so captions can become controllable scripts. Improve robustness to varied audio quality and reduce reliance on synthetic sources.

Why remember this: It turns video understanding from loose summaries into screenplay-like scripts with exact timing, camera moves, dialogue, and sound—bridging how humans watch movies and how AIs should explain them. That shift in structure and evaluation opens doors for better accessibility, smarter search, stronger reasoning, and more creative tools built on top of video.

Practical Applications

•Create accessible, time-stamped descriptions for people who are blind or deaf, covering both visuals and sounds.
•Enable precise video search like “show me when the driver admits borrowing the car,” jumping straight to the right second.
•Auto-generate study guides for lectures by segmenting topics and capturing key visuals, quotes, and audio cues.
•Help sports analysts find and describe exact plays, camera angles, and crowd reactions with timestamps.
•Assist video editors by producing a script that lists scenes, camera moves, and dialogue to plan cuts quickly.
•Improve customer support by summarizing tutorial videos into step-by-step, time-anchored instructions.
•Boost news verification by aligning claims with exact video moments and noting audio tone shifts.
•Power video QA systems with richer captions so they answer questions more accurately.
•Support content moderation by flagging time-anchored segments with sensitive actions or speech.
•Guide video-to-audio generation or dubbing by providing scene-by-scene cues of ambiance and tone.

Version: 1