RIVER: A Real-Time Interaction Benchmark for Video LLMs

Yansong Shi; Qingsong Zhao; Tianxiang Jiang; Xiangyu Zeng; Yi Wang; Limin Wang

RIVER: A Real-Time Interaction Benchmark for Video LLMs

Intermediate

Yansong Shi, Qingsong Zhao, Tianxiang Jiang et al.3/4/2026

arXiv

Key Summary

•RIVER Bench is a new test that checks how well AI can watch a video stream and talk with you in real time.
•It measures three skills: remembering the past (Retro-Memory), describing the present (Live-Perception), and responding right when something future happens (Pro-Response).
•Unlike old tests that give the whole video at once, RIVER times questions and answers exactly like a live conversation.
•The authors also show a simple way to turn offline video AIs into online ones using a sliding window plus long–short-term memory.
•They design fair scores that balance speed and accuracy, punishing early false alarms and gently scoring late answers.
•Across many models, offline systems did fine on single questions but struggled with strict real-time tasks.
•Adding memory and training with RIVER-style data made models react faster and remember longer, with big gains on proactive responses.
•The benchmark covers short to very long time gaps (seconds to an hour), so it can see how much models forget over time.
•This work pushes AI toward helpful, timely assistants for everyday video tasks like navigation, safety alerts, and step-by-step guidance.

Why This Research Matters

Real life happens live: cooking, driving, caregiving, or learning from demonstrations all need assistants that understand now, remember then, and speak at the right moment. RIVER provides the first clear, fair way to measure and train those skills together in video. With its timing-aware tasks and memory-friendly setup, it pushes models beyond offline summaries toward trustworthy, on-the-spot help. That means safer alerts, smoother navigation, and better step-by-step guidance. It also opens paths for AR headsets and home robots to become more reliable partners. By standardizing how we test these abilities, RIVER accelerates progress and helps everyone build assistants you can depend on.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re FaceTiming a friend while fixing a bike. You ask, “Did I already tighten that bolt?” (past), “What am I holding now?” (present), and “Tell me the moment the wheel starts spinning.” (future). You want answers right when you need them, not five minutes later.

🥬 Filling (The Actual Concept):

What it is: Before this paper, most video AIs worked offline—they watched an entire video first, then answered questions later, which is great for summaries but not for live help.
How it works (history of the field):
1. Early video benchmarks tested if AI could understand a full, finished video, often with multiple-choice questions.
2. Some systems stretched to long videos by sampling fewer frames or compressing, but they still waited until the end to answer.
3. A few newer systems tried streaming, but there wasn’t a clear, fair way to measure memory over time, instant understanding, and precisely timed responses.
Why it matters: Real life is live. If you’re cooking, biking, or wearing AR glasses, you need an assistant that remembers what just happened, understands what’s happening now, and alerts you exactly when something you asked for appears.

🍞 Bottom Bread (Anchor): Think of a soccer coach on the sideline. They recall who passed earlier (past), shout “Shoot!” right now (present), and tell the goalie to catch the ball as soon as it flies in (future). A good benchmark must test all three at live speed.

—

🍞 Top Bread (Hook): You know how a teacher doesn’t grade everyone only after the school year ends? They check progress during class too.

🥬 Filling (The Problem):

What it is: Existing tests didn’t capture live dialogue with streaming video—timing matters as much as correctness.
How it works (what was missing):
1. No standard way to test if models forget over time (no “forgetting curve” built in).
2. No joint measure for when you should answer versus how correct it is (speed–accuracy balance).
3. Future alerts weren’t scored in a way that discourages early false alarms while still rewarding timely hits.
Why it matters: Without this, we can’t compare models fairly or know what to fix to make them truly helpful in real life.

🍞 Bottom Bread (Anchor): If an AI shouts “Car coming!” ten seconds early, you’ll stop trusting it; if it shouts five seconds late, it’s not very helpful. We need a scoreboard that understands both timing and truth.

—

🍞 Top Bread (Hook): Imagine sorting a giant movie into three folders labeled “Before,” “Now,” and “Soon.”

🥬 Filling (Previous Attempts and Gap):

What it is: Past benchmarks sometimes tagged timestamps, but they didn’t formalize three distinct, interactive task types with precise rules for when to ask and when to answer.
How it works (why earlier tries fell short):
1. Many allowed models to peek at the whole video—unrealistic for live apps.
2. Some measured fluency or speed but not answer accuracy together with timing.
3. Few studied how memory fades as the question drifts farther from the event time.
Why it matters: Without this structure, we can’t train or judge assistants that must stay alert, remember well, and react exactly on cue.

🍞 Bottom Bread (Anchor): When you say, “Tell me the moment the toast pops,” you want an answer right then—not earlier guesses or late stories. The benchmark must capture that.

—

🍞 Top Bread (Hook): Think of a librarian who helps while you walk through the shelves. You ask about a book you saw earlier, what’s in front of you, and to tap your shoulder when a specific author appears.

🥬 Filling (RIVER’s Why):

What it is: RIVER Bench is a live-video interaction test that scores memory of the past, understanding of the present, and precisely timed responses about the near future.
How it works:
1. It uses videos from many sources and lengths, inserts questions at exact times, and labels the event times that matter.
2. It splits tasks into Retro-Memory, Live-Perception, and Pro-Response (instant or streaming narration).
3. It measures accuracy and timing together, and analyzes how performance changes as time gaps grow.
Why it matters: This turns real-time video help from a demo into something we can measure, improve, and trust.

🍞 Bottom Bread (Anchor): It’s like a driving test examiner with a stopwatch: Did you spot the sign? Did you react at the right second? Can you remember what lane you used earlier? RIVER checks all of this in a fair, repeatable way.

02Core Idea

🍞 Top Bread (Hook): You know how a good babysitter remembers what your little brother just did, watches what he’s doing now, and jumps in exactly when he’s about to spill juice?

🥬 Filling (The Aha! Moment):

One-sentence key insight: If we want real-time video AIs to be truly helpful, we must test and train them with a benchmark that treats time as a first-class citizen—past, present, and future—together with precise rules for when to watch, when to wait, and when to speak.

Multiple Analogies:

Traffic Light Analogy

Imagine a smart car assistant: remember where you turned (past), read the current light (present), and beep exactly when it turns green (future). RIVER checks all three.

Sports Replay Analogy

In basketball: recall who assisted (past), call the foul now (present), and shout when a screen is about to open (future). RIVER times the calls.

Cooking Show Analogy

Recall if salt was already added (past), describe the simmering now (present), and announce when the oven beeps (future). RIVER makes sure the timing and correctness are both right.

Before vs After:

Before: Models watched whole videos first, then answered. Memory wasn’t tested over long gaps. Timing wasn’t scored precisely; false early alarms weren’t discouraged well.
After: RIVER enforces live rules—limited windows, timestamped questions, and response windows with fair scoring—so models learn to remember longer, understand instantly, and respond on time.

Why It Works (Intuition, not equations):

When you limit what the model can see to a recent “window,” you force it to keep a useful diary (long-term memory) and use sharp attention for the current scene (short-term memory).
When scores reward on-time answers and penalize early guesses, the model learns to wait until the right cue appears.
When you spread questions across short to very long gaps, you expose and can fix forgetting.

Building Blocks (each with the Sandwich pattern):

🍞 Hook: Imagine peeking through a moving train window—you only see a bit at a time. 🥬 The Concept: Sliding Window Sampling shows the model just the latest chunk of video instead of the whole thing.
- How it works: (a) Sample frames at a steady rate; (b) Feed the current chunk into the model; (c) Slide forward and repeat.
- Why it matters: Without it, live systems would lag or run out of memory. 🍞 Anchor: Like watching a parade from one spot on the sidewalk—you watch each float as it passes.
🍞 Hook: You keep a small notepad for key details you might need later. 🥬 The Concept: Long-Term Memory Module stores compressed summaries of older video parts.
- How it works: (a) Save important features from earlier windows; (b) Compress or merge similar memories; (c) Retrieve them when a question references the past.
- Why it matters: Without it, the model forgets where you put the toolbox minutes ago. 🍞 Anchor: It’s like your photo album—small pictures, big memories.
🍞 Hook: You glance quickly at what’s in your hand right now. 🥬 The Concept: Short-Term Memory keeps the current window’s details fresh and precise.
- How it works: (a) Hold the most recent frames; (b) Use them for immediate questions; (c) Replace as the stream advances.
- Why it matters: Without it, the model might describe yesterday when you ask about now. 🍞 Anchor: Like a sticky note on your desk that you update every minute.
🍞 Hook: A referee needs both a rulebook (correctness) and a whistle timing (when to call). 🥬 The Concept: Latency–Accuracy Trade-off balances how fast and how right the answer is.
- How it works: (a) Reward answers within a tolerance window; (b) Penalize early false alarms strongly; (c) Give decaying credit to late answers.
- Why it matters: Without it, models either blurt wrong answers or wait too long to be useful. 🍞 Anchor: Like catching a bus—you must be at the stop on time; too early or late won’t work.
🍞 Hook: A treasure map marks the exact X where the gold is. 🥬 The Concept: Temporal Localization finds the exact moment an event happens.
- How it works: (a) Track cues across frames; (b) Match patterns (e.g., “wrench appears”); (c) Trigger response at the right second.
- Why it matters: Without it, the model answers at random times. 🍞 Anchor: Like pausing a replay exactly when the goal is scored.
🍞 Hook: A tour guide talks as you walk, not after the trip. 🥬 The Concept: Proactive Response lets the model narrate continuously or alert exactly when a requested event occurs.
- How it works: (a) Monitor the stream; (b) Detect the cue; (c) Speak right away.
- Why it matters: Without it, you get help only after the moment passes. 🍞 Anchor: “Tell me when the toast pops”—ding! The model speaks right then.

03Methodology

High-Level Recipe: Input (streaming video) → [Sliding Window sampling] → [Short-Term Memory] + [Long-Term Memory] → [LLM reasoning with timed prompts] → Output (on-time answers or continuous narration)

Step-by-Step with the Sandwich pattern and examples:

🍞 Hook: Imagine watching a parade and writing quick notes as each float passes. 🥬 The Step (Sliding Window Sampling):
- What happens: The system samples a few frames per second and groups them into a current window.
- Why it exists: Without windows, the model must process the entire video, causing delays and memory overflow.
- Example with data: If we sample 1 frame per second from a 2-minute clip, the model sees 120 frames in sequence, one small window at a time, not all at once. 🍞 Anchor: Like reading a book through a small bookmark window—one paragraph at a time.
🍞 Hook: You keep today’s sticky note on your desk. 🥬 The Step (Short-Term Memory):
- What happens: The latest window’s visual tokens represent now; they are precise and detailed.
- Why it exists: Without it, live questions (“What color is the cup right now?”) might use old frames.
- Example with data: A 16-frame window holds the most recent 16 frames; when a new frame arrives, the oldest drops out. 🍞 Anchor: Like a whiteboard you erase and rewrite every minute.
🍞 Hook: You also keep a small scrapbook for older but important pages. 🥬 The Step (Long-Term Memory with merging):
- What happens: Older windows are compressed and stored in fixed slots; similar ones are merged to save space.
- Why it exists: Without it, the system forgets events from minutes ago or runs out of memory.
- Example with data: Keep 8 memory slots, each summarizing 16 frames. If a ninth comes, merge the two most similar into one. 🍞 Anchor: Like combining near-duplicate photos into one album page.
🍞 Hook: When you ask something, the helper needs both the recent scene and the diary of what happened before. 🥬 The Step (Timed Prompting to the LLM):
- What happens: The system builds a prompt that labels the long-memory time span and the short-memory time span, then asks the question.
- Why it exists: Without explicit time labeling, the model can confuse past and present.
- Example with data: “This contains a long memory of 0.0 to 600.0 seconds. This contains a short clip sampled from 590.0 to 600.0 seconds. Question: …” 🍞 Anchor: Like giving a librarian both the archive shelf number and today’s newspaper.
🍞 Hook: Scoring live answers is like timing a 100m sprint: both when and how you finish matter. 🥬 The Step (Pro-Response Scoring):
- What happens: The benchmark defines a target time and a tolerance window; answers inside the window score full points; early answers get zero; late ones fade linearly.
- Why it exists: Without this, models could spam early guesses or delay too long without penalty.
- Example with data: If the true event is at 30.0s and the tolerance window is 4.0s (from 28.0–32.0s), answering at 31.0s gets full credit, at 27.5s gets 0, at 33.0s gets partial credit. 🍞 Anchor: Like catching a bus: arrive within the window, you ride; too early (bus not here) or too late (bus gone) is a miss.
🍞 Hook: You don’t always talk—sometimes you wait for the right moment. 🥬 The Step (When to Speak):
- What happens: The model continuously monitors incoming frames; if a user asked, “Tell me when the cat returns,” it stays silent until the cat reappears, then responds instantly.
- Why it exists: Without this patience, the model would be noisy and unhelpful.
- Example with data: In an egocentric kitchen video, the model waits 12 seconds after the request and says “The kettle is boiling now” exactly when steam rises. 🍞 Anchor: Like a museum guide who points out a painting only when you reach it.
🍞 Hook: If you store too much detail, your backpack gets heavy; if you store too little, you forget. 🥬 The Step (Compression and Merging in Memory):
- What happens: The model downsamples or averages similar memory slots to keep a fixed budget.
- Why it exists: Without it, memory explodes or becomes too fuzzy to be useful.
- Example with data: Average two similar slots (e.g., both “stirring in a bowl”) into one summary slot to free space. 🍞 Anchor: Like merging nearly identical class notes into one clean page.

The Secret Sauce:

Precisely timed tasks (past, present, future) plus a memory strategy (short + long) and fair timing-aware scoring push models to learn real live behavior: remember longer, see sharper now, and speak at the right moment.

Key Formulas (with kid-friendly examples):

Learning objective for a live turn: $L = -\log P_\theta\big(r_t \mid V_{t':t},\, q,\, h_{<t'},\, r_{<t}\big)$ . Example: if the model thinks the correct reply has probability $0.20$ , then $L = -\log(0.20) \approx 1.61$ .
Time gap that tests memory or anticipation: $\Delta = \lVert t_V - t \rVert$ . Example: if an event happened at $t_V = 90$ seconds and you ask at $t = 120$ seconds, then $\Delta = 30$ .
Streaming frame token shape used by the visual encoder: $\big(1 + 3\times 3\big) \times d = 10 \times d$ . Example: if $d = 1024$ , that is $10 \times 1024$ tokens per frame.
Timing score for proactive response (simplified intuition): if $t_g$ is the true time and $w$ is the tolerance window, then answers inside $[t_g - w/2,\, t_g + w/2]$ score $1$ , early answers score $0$ , late answers decay linearly. Example: with $t_g = 20$ s and $w = 4$ s, answering at $21$ s scores $1$ , at $18$ s scores $1$ , at $16$ s scores $0$ , and at $23$ s gets partial credit (e.g., $\approx 0.5$ if the decay is set that way).

04Experiments & Results

🍞 Top Bread (Hook): Picture a science fair where each robot helper must help you during a live video, not after the show ends.

🥬 Filling (The Test):

What they measured and why:
1. Retro-Memory: Can models answer questions about the past as time gaps grow from seconds to an hour?
2. Live-Perception: Can they answer correctly about what’s happening right now, with low delay?
3. Pro-Response: Can they wait and respond exactly when a requested event happens, avoiding early false alarms?
How they did it: They used videos from Vript-RR, LVBench, LongVideoBench, Ego4D, and QVHighlights; each question and event got exact timestamps.

🍞 Bottom Bread (Anchor): Like testing runners for sprints (now), marathons (long memory), and batons handoff timing (proactive alerts).

—

The Competition:

Closed-source leaders: GPT-4o, Gemini 1.5-Pro.
Open models (offline by default): VideoChat2, InternVL2.5, LLaVA-Video, VideoChat-Flash.
Native online or adapted: Flash-VStream, VideoLLM-Online, and offline models upgraded with the authors’ sliding-window + memory pipeline.

Scoreboard with Context:

Retro-Memory (remembering the past): GPT-4o performed best among compared systems; most models’ accuracy dropped as the time gap increased. Models with added long–short-term memory decayed more slowly—like getting a steadier B instead of slipping from B to D over time.
Live-Perception (understanding now): GPT-4o led here too. Notably, some offline models, once adapted with the online pipeline, became highly competitive—even beating some native streaming baselines on present-time questions. That’s like teaching a good test-taker to also answer fast in class.
Pro-Response (timed future alerts): Baseline streaming models struggled under RIVER’s strict timing. After fine-tuning VideoLLM-Online with RIVER-style training data, proactive response accuracy jumped by about 11 percentage points—like moving from a C to a strong B on the hardest timing test.

Surprising Findings:

Memory mechanisms matter more than expected: a memory-augmented model kept its accuracy steadier across long gaps, countering the typical fast-forgetting trend.
Not all online claims generalize: systems tuned for long-video comprehension didn’t automatically shine at live, interactive Q&A under strict timing rules.
Fine-grained vs causal cues: All models were weaker at causal reasoning (why/what-next), showing a clear area for growth.

Numbers made meaningful:

Think of a class where most students score around B- on single, easy questions. When asked to work live, remember old details, and shout an alert at the perfect second, many drop to C or D. With RIVER-style memory and training, they climb back toward B and even B+ on some tasks.

Key Formula Reminder (and why it helps interpret results):

Time gap $\Delta = \lVert t_V - t \rVert$ guides the forgetting analysis. Example: a question asked $600$ seconds after the event ( $\Delta = 600$ ) is much harder than at $30$ seconds ( $\Delta = 30$ ), so a slower accuracy decline means the memory module is working.

🍞 Bottom Bread (Anchor): It’s like teaching lifeguards: not only must they know the pool rules (accuracy) and watch swimmers right now (perception), but they must also blow the whistle at the exact right second (timing). RIVER shows which lifeguards are truly ready.

05Discussion & Limitations

🍞 Top Bread (Hook): Imagine grading a band that plays live. It’s harder than judging a studio album because timing, coordination, and quick reactions all count.

🥬 Filling (Honest Assessment):

Limitations:
1. No audio yet. Many real-time cues are sounds (kettle whistle, doorbell), so adding audio will make the test closer to real life.
2. Benchmarks can’t cover every scenario; more messy, real-world clips (lighting changes, occlusions) would add toughness.
3. Some scoring still relies on an LLM grader for open-ended answers, which can introduce small inconsistencies despite strong controls.
4. Causal reasoning remains challenging; future datasets may need richer chains of events and goals.
Required Resources: • A GPU-capable setup to run vision encoders and LLMs fast enough for live windows. • Memory budgets for both short-term and long-term slots. • The RIVER annotations and evaluation code (index-only release; users fetch videos from original sources).
When NOT to Use: • Purely offline summarization tasks where exact timing doesn’t matter. • Static-image benchmarks; RIVER is meant for live video streams. • Ultra-low-power edge devices without enough speed to maintain sliding windows.
Open Questions:
1. How to combine audio with video for even better, earlier cues?
2. What memory compression best preserves causal details without bloating tokens?
3. Can we create training signals that directly teach “don’t speak yet” vs “speak now” in a more principled way across domains?
4. How can we evaluate nuanced, multi-step proactive plans (not just single alerts) live?

🍞 Bottom Bread (Anchor): Like moving from a solo recital to a full orchestra with audience requests—RIVER takes the first big step, but adding sound, richer pieces, and better conductors will make the show shine.

06Conclusion & Future Work

Three-Sentence Summary:

This paper introduces RIVER Bench, a real-time interaction benchmark that fairly tests video AIs on remembering the past, understanding the present, and responding exactly when future events occur.
It also provides a simple online pipeline—sliding windows plus long–short-term memory—and a specialized training set that together boost live performance, especially for proactive responses.
Experiments across many models reveal clear gaps and show that timing-aware evaluation and memory modules are key to trustworthy, on-the-spot assistance.

Main Achievement:

Turning time into a first-class citizen for video LLMs—by precisely defining and measuring retro-memory, live-perception, and pro-response, and showing practical ways to make offline models act online.

Future Directions:

Integrate audio for richer, earlier cues; design stronger causal-reasoning tasks; build standardized, robust LLM graders; and explore lighter, smarter memory compression for edge devices.

Why Remember This:

RIVER shifts video AI from “watch first, answer later” to “watch, remember, and act right on time,” which is exactly what real-world helpers must do—whether guiding a recipe, navigating AR, or keeping people safer through timely alerts.

Practical Applications

•AR navigation assistant that remembers recent turns and announces the next turn exactly on time.
•Kitchen helper that confirms past steps (salt added), describes current actions, and alerts when the timer or visual cue triggers.
•Workshop safety aide that warns right when a hand nears a dangerous zone, not before or after.
•Classroom demo coach that narrates lab steps live and reminds students of earlier steps on request.
•Sports training tool that flags the exact moment of form errors and recalls prior attempts for comparison.
•Elder care monitor that notes routine activities, describes current situations, and prompts at medication times on cue.
•Robotics supervisor that tracks task stages, perceives current states, and triggers handoffs precisely when conditions are met.
•Video QA customer support that answers about earlier moments in a tutorial and guides users in real time.
•Security analyst assistant that bookmarks past events and pings exactly when a watched-for object reappears.
•Streaming content narrator that provides concise, timely descriptions for accessibility without lag or early noise.

Version: 1