NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Yufan Wen; Zhaocheng Liu; YeGuo Hua; Ziyi Guo; Lihua Zhang; Chun Yuan; Jian Wu

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Beginner

Yufan Wen, Zhaocheng Liu, YeGuo Hua et al.2/9/2026

arXiv

Key Summary

•NarraScore turns a video's changing story into a matching soundtrack by using emotion as the bridge.
•It teaches a vision-language model to read a movie like a book and output a smooth curve of feelings called Valence (happy–sad) and Arousal (calm–excited).
•The system splits control into two layers: a Global Semantic Anchor for overall style (like 'mysterious, strings, slow') and a Token-Level Affective Adapter for moment-to-moment tension.
•A Temporal Super-resolution Adapter stretches and smooths the emotion curve so music changes feel natural, not jumpy.
•Instead of rebuilding a big model, NarraScore adds tiny, smart modules that gently nudge a frozen music generator (MusicGen) with very low extra cost.
•On long videos, it beats strong baselines in both numbers (e.g., best FAD and KLD) and human ratings for emotional consistency, style match, and long-term coherence.
•It avoids expensive frame-by-frame attention by using overlapping windows plus a single global style, keeping memory and time reasonable.
•The method shows frozen VLMs already carry enough narrative understanding to act as 'affective sensors' with only a small probe on top.
•Main limits are fine-grain timing for very fast events and possible error carry-over from the emotion reader; future work aims for end-to-end training.
•This paves the way for fully automatic, film-like scoring for vlogs, movies, games, and education videos.

Why This Research Matters

Matching music to a story’s feelings keeps viewers engaged, helps learning, and makes content feel professional. NarraScore lets solo creators and small teams get film-like scoring without hiring a composer or using expensive tools. Because it’s efficient and mostly frozen, it’s easier to run and scale to long videos found in vlogs, documentaries, and lectures. Games and interactive media benefit too, since music can follow the player’s unfolding actions with emotional logic, not just sound effects. It also encourages accessible creativity: students can experiment with emotional curves to understand storytelling. Finally, it points to a broader idea—using emotion as a compact control signal—to coordinate audio, visuals, and even lighting or haptics in future multimedia.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a great movie soundtrack makes your heart race in a chase scene and then calms you down during a quiet sunset? The music follows the story’s feelings, not just the pictures.

🥬 The Concept (Affective control): It’s the idea of steering music using the audience’s feelings over time so soundtracks rise and fall with the plot. How it works:

Read the video story and estimate how feelings change second by second.
Use those feelings to guide when music should speed up, quiet down, brighten, or darken.
Keep the big picture style steady while allowing small, timely changes. Why it matters: Without affective control, long videos get one-note, mismatched music that ignores tension, surprise, or relief. 🍞 Anchor: In a spooky hallway scene that gets scarier, affective control turns up tension gradually instead of staying flat.

The World Before: AI could already make short musical clips for short videos by looking at motion or objects in frames. But that often meant matching surface visuals (“there’s a dog—play something playful”) and ignoring deeper story flow (rising danger, fake-out, twist). Models used dense frame-by-frame attention: accurate for a few seconds, but costly and confused over minutes. They also lacked a ‘global anchor,’ so style drifted—starting as calm piano, ending as random synths.

The Problem: Long videos need two things at once—global coherence (a consistent style) and local alignment (music that reacts to each moment). Old methods either:

Spent too much compute attending to every frame, or
Lost the story thread and emotional arc, sounding generic or disjointed.

Failed Attempts:

Rule-based mappings from motion to beats: too simple for nuanced stories.
Frame-wise CLIP emotion classifiers: noisy and shallow; often miss hidden tension.
Text captions driving music: good big-picture mood but weak at second-to-second changes.
Segment-then-stitch methods: continuous sound but little true narrative responsiveness.

🍞 Hook: Imagine reading a whole book by staring at each letter separately—you’d miss the plot.

🥬 The Concept (Vision-Language Models, VLMs): VLMs understand pictures and text together, so they can reason about what’s happening in a video beyond just objects. How it works:

Show frames plus tiny time tags (“Time: 12s”).
Prompt the model to focus on story and feelings.
Use a small probe to turn its hidden understanding into emotion numbers. Why it matters: Without VLM-level reasoning, we misread scenes with subtle tension or implied danger. 🍞 Anchor: A VLM can tell the difference between “a person smiling in a tense courtroom” (nervous) vs. “a person smiling at a picnic” (relaxed).

The Gap: We needed a way to squeeze the story’s deep logic into a compact control signal that music models can follow—continuously and cheaply—over minutes.

🍞 Hook: Think of a map that shows only the roads you need today—no clutter.

🥬 The Concept (Valence–Arousal trajectories): A two-line curve over time: Valence (happy–sad) and Arousal (calm–excited) that compresses complex narrative emotion. How it works:

Sample the video at 1 frame per second with time hints.
Ask the VLM (with a guiding instruction) to think about feelings.
Use a tiny neural head to output VA values each second, then smooth them. Why it matters: Without VA curves, music changes feel random, not story-driven. 🍞 Anchor: When a chase starts, Arousal rises; when the hero reunites with family, Valence rises.

Real Stakes: This matters for YouTubers, filmmakers, teachers, and gamers. Viewers remember how something made them feel; if the soundtrack follows the emotional arc, people stay engaged, learn better, and enjoy more. Plus, making film-grade scoring affordable and automatic opens creative doors for everyone.

02Core Idea

🍞 Hook: Imagine a movie’s feelings were a simple rollercoaster line—if you gave that line to a band, they could play perfectly to the ups and downs.

🥬 The Concept (The Aha!): Emotion is a high-density summary of narrative logic, so we can guide music for long videos by extracting a smooth Valence–Arousal curve and combining it with a global style anchor. How it works:

Use a frozen VLM as an ‘affective sensor’ to turn video frames into VA curves.
Create a Global Semantic Anchor (style, instruments, pace).
Inject the global anchor for steadiness and the local VA for moment-to-moment tension into a frozen music model via tiny adapters. Why it matters: Without this two-layer control, music either drifts in style or ignores the scene’s changing feelings. 🍞 Anchor: A documentary stays ‘warm acoustic folk’ the whole time (global), while swelling gently during emotional interviews (local).

Multiple Analogies:

Weather vs. climate: Climate (global anchor) sets the region’s vibe; weather (VA) changes daily.
Director and conductor: The director sets the film’s aesthetic (global), the conductor shapes each musical phrase to match scenes (local).
Smoothie recipe: Base flavor (global) is constant; add fruit shots (local) for bursts of taste.

🍞 Hook: You know how having a theme keeps a playlist feeling unified?

🥬 The Concept (Global Semantic Anchor): A structured text summary of genre, instruments, emotion, and tempo that stabilizes musical identity. How it works:

Ask the VLM to describe the implied sound, not the visuals.
Use a fixed schema (genre, texture, mood, pacing) for consistency.
Feed it to the music model through cross-attention to steer style. Why it matters: Without it, long tracks drift—like switching stations mid-song. 🍞 Anchor: “Ambient orchestral, strings and soft pads, pensive, slow–medium” as a single anchor, kept throughout a 5-minute short film.

🍞 Hook: Imagine tapping someone’s shoulder at just the right moments to speed up or calm down.

🥬 The Concept (Token-Level Affective Adapter): A tiny module that nudges the music model’s early layers with the current VA, token by token. How it works:

Stretch and smooth VA to match dense music tokens.
Project VA into the model’s hidden space.
Add it as a small bias only in shallow layers so deeper layers keep sound quality. Why it matters: Without gentle, precise nudges, changes feel jumpy or the audio quality drops. 🍞 Anchor: As tension rises from 0.2 to 0.6 arousal, drums subtly add energy without changing the whole style.

Before vs. After:

Before: Heavy frame attention, style drift, generic background hum.
After: Stable style plus fluid emotion-following changes; scalable to minutes.

Why It Works (Intuition):

VA compresses narrative—so control is simple and robust.
Frozen backbones keep high-fidelity sound; small adapters guide, not rewrite.
Separate global vs. local paths avoid tug-of-war between identity and reactivity.

Building Blocks (in order):

Affective control (steer by feelings).
VLM affective sensor (reads story feelings).
Valence–Arousal trajectories (compact curve).
Global Semantic Anchor (overall style).
Temporal Super-resolution Adapter (smooth local curve to token-level).
Token-Level Affective Adapter (precise, early-layer nudging).

03Methodology

At a high level: Video → VLM reasoning + instruction → (A) Global Semantic Anchor, (B) Per-second Valence–Arousal → Temporal Super-resolution (interpolate + dilated conv) → Dense affective control → MusicGen decoder (global via cross-attention, local via token-wise bias) → Audio tokens → Codec → Waveform.

🍞 Hook: Imagine turning a long movie into a timeline with chapter titles (global) and emoji moods per second (local).

🥬 The Concept (Semantically-Anchored Temporal Alignment): A way to feed frames with tiny time tags to a VLM so it reasons causally over the story. How it works:

Sample video at 1 Hz to balance detail and compute.
Interleave frames with text like “Time: 37s.”
Let the VLM build a story-aware context. Why it matters: Without time anchors, the model treats frames as unrelated snapshots. 🍞 Anchor: A montage scene becomes a tidy sequence the model can follow: Time: 1s (arrival), 2s (glance), 3s (door creaks)...

Step 1: Instruction-Driven Semantic Steering

What happens: Prepend a special instruction that says, “Focus on narrative tension and emotions, not object lists.”
Why: It activates high-level reasoning and suppresses low-level enumeration.
Example: Instead of “chair, lamp,” the model encodes “anxious pause before confrontation.”

🍞 Hook: Picture a thermometer for feelings that shows happy–sad and calm–excited.

🥬 The Concept (Latent Affective Probing to get VA): A tiny head reads the VLM’s hidden states and outputs Valence–Arousal per second. How it works:

Average spatial tokens per frame.
Pass through a small MLP to get VA in [-1, 1].
Train only this head with L1+L2 loss on an induced-emotion dataset. Why it matters: Without a light probe, we’d need huge labeled datasets to train from scratch. 🍞 Anchor: A suspense build shows arousal rising over 20 seconds; the probe captures a smooth climb.

Step 2: Holistic Musical Conceptualization (Global Semantic Anchor)

What happens: Ask the VLM to output a structured music-style card: genre, instruments/timbre, atmosphere, pacing—avoiding visual nouns.
Why: Keeps style coherent and aligned to audio model’s vocabulary.
Example: “Neo-classical, piano+strings, solemn, slow–medium.”

🍞 Hook: Like turning a dot-to-dot line into a smooth curve so drawing looks natural.

🥬 The Concept (Temporal Super-resolution Adapter): It expands sparse VA points into a smooth, token-level control stream. How it works:

Linearly interpolate VA from per-second to per-token.
Apply dilated temporal convolutions to smooth jitter and widen context.
Map 2D VA into the music model’s hidden dimension. Why it matters: Without it, music jumps abruptly at each second. 🍞 Anchor: Instead of sudden drum hits at 1-second marks, energy ramps in gracefully.

Step 3: Hierarchical Acoustic Synthesis (Dual-Branch Injection)

Global path: Feed the style anchor via existing cross-attention—sets genre/instrument feel.
Local path: Inject dense affective control as a tiny additive bias only in shallow transformer blocks—guides dynamics while preserving fidelity.
Why: Early layers set trajectory; deep layers polish sound and harmony.
Example with data: If VA goes from (V=-0.2, A=0.3) to (V=-0.1, A=0.7), percussion tightens and tempo feel lifts, but still within “ambient orchestral.”

🍞 Hook: Think of writing a long essay one page at a time, always looking back a bit so the story flows.

🥬 The Concept (Sliding-Window Inference): Generate long music by overlapping chunks that share context. How it works:

Compute one global style from keyframes for the whole video.
Process emotion and music in overlapping windows.
Use the last tokens of the previous chunk to prompt the next for seamless continuation. Why it matters: Without overlap and a global anchor, long tracks break or drift. 🍞 Anchor: A 6-minute vlog scores as consistent ‘lo-fi chill’ while adapting smoothly to each scene change.

The Secret Sauce:

Emotion as compressed control: simple but expressive.
Minimal, surgical injections: nudge a frozen backbone instead of retraining it.
Two-lane control (global + local): prevents style vs. dynamics conflict.
Windowing with continuity prompts: scales to minutes without memory blowups.

04Experiments & Results

The Test: The team measured how close the generated music distribution is to real music (FAD, FD, KLD), how well it covers variety (density/coverage), and cross-modal alignment (ImageBind). They also ran human studies on Emotional Dynamic Consistency (EDC), Global Style Matching (GSM), Long-term Coherence (LTC), Music Quality (MQ), and Overall Preference (OP).

The Competition: Strong baselines included VidMuse and GVMGen (recent video-to-music), Video2Music and M2UGEN (established approaches), and a strong two-stage Caption2Music pipeline (VLM captions → MusicGen).

Scoreboard with Context:

FAD (lower is better): NarraScore 1.923 vs. VidMuse 2.459 and GVMGen 2.362—like getting the best audio ‘accent’ match to real music.
FD and KLD (lower is better): NarraScore FD 36.411, KLD 0.320—again best, meaning distributions are closer and less biased.
ImageBind (higher is better): NarraScore 0.219, competitive with top methods, though this metric doesn’t fully reflect temporal narrative.
Human ratings (long-form): NarraScore tops EDC 2.86, GSM 3.02, LTC 3.15, MQ 3.41, OP 3.06; think of this as consistently earning A’s where others hover around B’s.
Human ratings (short/mid): NarraScore still leads, with the margin growing on longer videos—evidence that the hierarchical design pays off as duration increases.

Surprising Findings:

Text-only pipelines (Caption2Music) make decent global mood but miss finer emotional timing—good on short samples, weak on long arcs.
MIDI-leaning baselines can sound polished yet feel generic and emotionally detached from the video.
Ablations show Narrative-Aware Affective Reasoning (the VLM-driven VA probe) improves results even with very strong VLMs—explicit emotion curves still help.
Too much local injection or injecting into all layers can hurt quality—best is to guide early layers and let deeper ones preserve fidelity.

Takeaway: Numbers and listeners agree—NarraScore balances steady style with living, moment-to-moment emotion, especially on long videos where others drift or fragment.

05Discussion & Limitations

Limitations:

Temporal granularity: Per-second emotion can miss very fast visual beats (e.g., 200 ms cuts), so micro-synchrony isn’t frame-perfect.
Cascaded dependency: If the emotion reader is off, errors flow downstream into the music.
Domain gaps: Datasets for induced emotion (viewer feelings) are still limited compared to face-expression datasets.
Static style anchor: One global style may be too rigid for films with deliberate mid-story genre shifts.

Required Resources:

A frozen VLM (e.g., VideoLLaMA family) and a frozen music decoder (e.g., MusicGen-Small), plus a small GPU for adapters.
Datasets with continuous VA labels for both video (induced affect) and music (dynamic emotion).

When NOT to Use:

Ultra-fast editing or beat-perfect synchronization needs (e.g., music videos with sub-second cuts).
Projects requiring precise diegetic sound alignment (on-screen instrument performance) rather than background scoring.
Cases where emotion labels are unreliable or the visual domain is far outside training (e.g., abstract animation with unusual symbolism).

Open Questions:

Can end-to-end training reduce error propagation and improve timing?
How to adaptively change the global anchor across scenes without style whiplash?
Can we personalize affect (different audience segments) while keeping narrative fidelity?
What are better temporal metrics for narrative alignment beyond static text–audio similarity?
Can we extend the affect manifold beyond VA (e.g., tension, suspense, awe) while keeping control simple?

06Conclusion & Future Work

Three-Sentence Summary: NarraScore treats emotion as a compact summary of story logic, using a frozen VLM to produce smooth Valence–Arousal curves and a structured global style anchor. These two signals are injected cleverly—global via cross-attention and local via a tiny token-wise adapter—into a frozen music model to yield long, coherent, and reactive soundtracks. The result outperforms strong baselines with minimal compute, especially on long videos where narrative flow matters most.

Main Achievement: Proving that frozen VLMs can act as high-quality ‘affective sensors’ and that tiny, hierarchical controls can steer powerful music generators to follow evolving stories without retraining the backbone.

Future Directions: End-to-end optimization to tighten timing, adaptive multi-anchor styles for scene changes, richer affect spaces beyond VA, better temporal evaluation metrics, and efficient distillation to lower latency. Exploring user-in-the-loop adjustments (e.g., sketching custom emotion curves) could add creative control.

Why Remember This: It reframes video scoring as emotion-driven control, not frame matching—simple curves, big impact. By separating global identity from local dynamics and nudging rather than rebuilding, NarraScore makes professional, narrative-aware scoring practical for everyone from filmmakers to students.

Practical Applications

•Auto-score YouTube vlogs so music quietly lifts during key moments without manual editing.
•Generate documentary soundtracks that keep a consistent style while following interview intensity.
•Create classroom videos where music swells during big ideas to aid memory and attention.
•Provide indie filmmakers with affordable, narrative-aware temp tracks for editing and screening.
•Score game cutscenes that adapt to tension, victory, or loss without hand-authored music logic.
•Assist podcast-to-video adaptations with gentle background music that matches topic shifts.
•Prototype ad and trailer music that rises and falls with storytelling beats for better impact.
•Enable accessibility tools that sonify emotion changes for visually impaired audiences.
•Support interactive museum exhibits where music follows visitor pacing and scene focus.
•Offer creators a simple ‘draw your emotion curve’ interface to steer custom scores.

Version: 1