🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions | How I Study AI

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Intermediate
Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen et al.3/4/2026
arXiv

Key Summary

  • •InfinityStory is a new system that can make very long videos (even hours) where the world stays the same and characters transition smoothly between shots.
  • •It fixes a big problem in video generation: backgrounds used to drift and change, but InfinityStory locks each scene to a specific location so it looks stable.
  • •The pipeline plans stories with multiple smart helpers (agents) that split a script into chapters, locations, scenes, and shots.
  • •Odd-numbered shots are made from a key image (I2V) and even-numbered shots are special transition clips made from the last frame of the previous shot and the keyframe of the next one (FLF2V).
  • •They built a unique dataset of 10,000 short videos showing characters entering, exiting, and swapping so the model learns smooth multi-subject transitions.
  • •On the VBench benchmark, it scored the highest for Background Consistency (88.94) and Subject Consistency (82.11), and the best overall average rank (2.80).
  • •Human studies also preferred InfinityStory for background stability, motion smoothness, and overall quality over other methods.
  • •The system runs at 480p right now, which slightly lowers image aesthetics compared to higher-resolution models, but the consistency benefits are strong.
  • •Its biggest limitation is generalizing transitions to brand-new character mixes and complex plots, which the authors plan to improve with more diverse data and supervision.

Why This Research Matters

Long videos for education, entertainment, and communication need steady places and natural handoffs, or viewers quickly feel lost. InfinityStory shows how to engineer that steadiness: lock each scene’s background first, then animate people moving in and out with trained transitions. This makes AI-made content watchable for kids’ shows, explainer lessons, and indie films without expensive filming. Brands and creators can keep characters on-model across episodes, improving trust and continuity. It can reduce reshoots and manual editing by automating the hardest parts—consistency and transitions. As resolution and data grow, the same blueprint can scale to truly cinematic quality. In short, it’s a practical step toward AI that tells stories like a real movie crew.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you're making a school play on video. The stage, props, and classroom should stay the same across scenes, and actors should enter and leave smoothly. If the stage keeps changing or actors pop in and out magically, the story feels broken.

🥬 Filling (The Actual Concept): Video Generation Pipeline

  • What it is: A step-by-step process that turns a written story and character pictures into a full video.
  • How it works: Like a factory line—plan the story, make background images, insert characters, generate main shots, then generate transition shots, and finally stitch everything together.
  • Why it matters: Without a clear pipeline, long videos become messy; scenes drift, and characters look inconsistent.

🍞 Bottom Bread (Anchor) Think of baking cookies: mix dough, cut shapes, bake, cool, and decorate. A video pipeline follows its own fixed recipe to get a reliable final video.

🍞 Top Bread (Hook) You know how your classroom walls and desks are always in the same place day after day? That helps you feel grounded.

🥬 Filling (The Actual Concept): Background Consistency

  • What it is: Keeping the setting (like a forest or a bedroom) visually the same across all shots in a scene.
  • How it works: The system makes a library of background locations first, then reuses the matching background every time that scene appears. For each shot, it fuses the characters onto the same background before making the video.
  • Why it matters: If the background drifts (sofa moves, window flips, lighting shifts), viewers get confused and lose the story thread.

🍞 Bottom Bread (Anchor) In a living-room scene, the same couch, rug, and window keep their look and position across every shot in that scene.

🍞 Top Bread (Hook) Imagine directing a class project with four helpers: one splits the story into chapters, another chooses places, another plans scenes, and another decides camera shots.

🥬 Filling (The Actual Concept): Multi-Agent System

  • What it is: Several specialized AI helpers (agents) that plan the story from big pieces (chapters) down to tiny details (shots).
  • How it works: Chapter Agent makes plot beats, Location Agent builds a location library, Scene Agent maps scenes to a single location, Shot Agent writes exact shot instructions and transition rules.
  • Why it matters: Without coordinated helpers, the plan breaks—backgrounds change accidentally and transitions feel jarring.

🍞 Bottom Bread (Anchor) Like a movie crew: the screenwriter, location scout, scene director, and cinematographer work together so the film feels cohesive.

🍞 Top Bread (Hook) Think of setting up a stage before actors arrive: you put the backdrop in place and then let actors perform.

🥬 Filling (The Actual Concept): Location-Grounded Background Injection

  • What it is: A method to lock a scene to a pre-made background before adding characters.
  • How it works: First make a clean background image for each location; then create a keyframe that pastes the character(s) into that background; then generate the video from that keyframe.
  • Why it matters: Without “injecting” the same background, each new shot might redraw the room slightly differently.

🍞 Bottom Bread (Anchor) The “Castle” scene always starts from the same castle background image, so the throne and stained-glass windows never wander.

🍞 Top Bread (Hook) Imagine you have a starting photo and an ending photo of a magic trick, and you need to fill in all the steps in between.

🥬 Filling (The Actual Concept): First-Last-Frame-to-Video (FLF2V)

  • What it is: A model that takes the last frame of the previous shot and the keyframe of the next shot, and generates a smooth in-between transition video.
  • How it works: It reads the “from” frame and the “to” keyframe, plus rules about who’s entering or exiting, and then draws all the in-between frames.
  • Why it matters: Without FLF2V, clips just cut hard; characters may pop in/out and movement looks choppy.

🍞 Bottom Bread (Anchor) End of Shot A shows an empty doorway; start of Shot B’s keyframe shows a character inside. FLF2V animates the character naturally walking through the door between them.

🍞 Top Bread (Hook) Think of a relay race: one runner slows down and passes the baton right as the next runner speeds up—smooth handoff!

🥬 Filling (The Actual Concept): Cinematic Multi-Subject Transition Synthesis (CMTS)

  • What it is: A way to create transitions where multiple characters can enter, exit, or switch places smoothly between shots.
  • How it works: The system encodes who’s in the previous shot, who appears at the beginning and end of the transition, who enters/exits, and what movement type it is; then it uses FLF2V trained on such cases to animate it.
  • Why it matters: Without CMTS, characters appear or vanish suddenly, breaking the movie feel.

🍞 Bottom Bread (Anchor) Two friends on a bench (Shot A) become three friends (next shot’s keyframe); the transition shows the third friend walking in from off-screen and sitting down, not teleporting.

🍞 Top Bread (Hook) Sports teams do scrimmages to practice tricky plays before the real game.

🥬 Filling (The Actual Concept): Synthetic Dataset for Multi-Character Transitions

  • What it is: A large collection of short, made-up transition videos focused on entry, exit, and replacements.
  • How it works: Agents write diverse prompts; a video model makes clips; a vision-language model filters out bad ones; the final set trains FLF2V.
  • Why it matters: Without practice data showing multi-person transitions, the model never learns smooth handoffs.

🍞 Bottom Bread (Anchor) They generated 10,000 transition clips and kept about 3,980 high-quality examples to teach the model realistic entrances and exits.

Before this work, long videos made by AI often looked like a patchwork of unrelated short clips—backgrounds kept morphing and characters popped in and out at cuts. People tried better prompts or longer attention windows, but these didn’t firmly anchor the background or control multi-character transitions. InfinityStory fills that gap by grounding each scene in a fixed location and teaching transitions with explicit entry/exit logic. That matters for anyone who wants AI to make watchable, story-driven videos—education videos, kid shows, explainer films, or indie projects—because consistency and smoothness are what make a story feel real.

02Core Idea

🍞 Top Bread (Hook) You know how a Lego castle stays sturdy if you lock the base plates first and then snap figures on top? Building the base first keeps everything steady while you tell your story.

🥬 Filling (The Actual Concept): The “Aha!” Moment

  • What it is: Lock the world first (fixed locations), then choreograph people moving through it (trained transitions) so long videos stay consistent and feel cinematic.
  • How it works: A planning team of AI agents maps the story into chapters → locations → scenes → shots, then every scene uses the same background with characters injected, and transitions are generated by a model that understands who enters/exits.
  • Why it matters: Without a locked world and smart handoffs, long videos wobble—rooms drift and characters teleport.

🍞 Bottom Bread (Anchor) A 10-minute scene in a café keeps the same tables and windows, while patrons come and go smoothly between shots.

Three analogies to see it clearly:

  1. Stage play analogy: First, hang the same backdrop for the whole act; second, direct the actors to enter and exit on cue. Result: consistency plus flow.
  2. Map-and-route analogy: Fix the map (backgrounds), then draw routes (transitions) that show how people move from point A to B.
  3. Comic-to-animation analogy: Fix the panel’s setting, then animate character movements between panels so nothing teleports.

Before vs. After:

  • Before: Each clip acted like its own mini-world—walls, lights, and props shifted; transitions were hard cuts where people popped in or out.
  • After: Scenes reuse the same background; characters appear or exit with actual motion; shots link together like a real movie.

Why it works (intuition, no equations):

  • Decoupling: Backgrounds are handled once per scene, so variations don’t accumulate.
  • Anchoring: A composited keyframe nails character identity in the right place against the fixed background.
  • Bridging: FLF2V transitions know both endpoints (last frame and next keyframe) and the transition plan (who enters/exits), so the model draws believable in-betweens.
  • Planning: Agents enforce constraints (e.g., scenes must bind to a single location and have alternating shot types), which prevents chaos.

Building Blocks (as tiny pieces): 🍞 Hook Think of making a sandwich: bread first (world), filling next (people moving), then slice it cleanly (edits).

🥬 Piece 1: Story Decomposition (Agents)

  • What it is: Split story into chapters, locations, scenes, and shots with rules.
  • How it works: Four agents create structure and metadata (who’s in which shot, which location is used, what camera move, etc.).
  • Why it matters: Structure avoids drifting settings and inconsistent casting.

🍞 Anchor Like a teacher’s lesson plan broken into units, topics, activities, and tasks.

🥬 Piece 2: Location Library and Background Injection

  • What it is: Make a set of reusable backgrounds and always reuse the right one.
  • How it works: Generate a clean background per location; compose keyframes with characters on top.
  • Why it matters: It freezes the world so it won’t slide around.

🍞 Anchor Every time the story goes to “Forest,” it’s the same forest image layout.

🥬 Piece 3: Odd Shots = I2V from Keyframes

  • What it is: Main narrative shots start from a keyframe.
  • How it works: Create a keyframe with characters on the set, then animate it into a shot.
  • Why it matters: It keeps character looks and positions steady from the start.

🍞 Anchor Keyframe: the hero at the café counter; the I2V shot animates them ordering.

🥬 Piece 4: Even Shots = FLF2V Transitions

  • What it is: Bridging shots between two anchors.
  • How it works: Use the last frame of the previous shot and the next keyframe, plus who’s entering/exiting, to animate the handoff.
  • Why it matters: Without it, you get jumpy cuts and teleporting actors.

🍞 Anchor From empty doorway (end of Shot A) to hero inside (next keyframe), the transition shows the walk-in.

🥬 Piece 5: Multi-Subject Transition Training (CMTS)

  • What it is: A dataset and training focus on 0→X, X→0, and swaps.
  • How it works: Generate and filter thousands of examples so the model learns multi-person choreography.
  • Why it matters: Real scenes often have more than one person moving—this teaches the model to handle that.

🍞 Anchor Two people talking become three; the third doesn’t teleport—he walks in on camera.

03Methodology

At a high level: Story text + character images → Multi-agent planning (chapters → locations → scenes → shots) → Location backgrounds + character keyframes (injection) → Odd shots (I2V) + Even shots (FLF2V with transition metadata) → Stitch shots into a long, consistent video.

Step-by-step, like a recipe:

  1. Input collection
  • What happens: You provide a short story description, plus reference images and names for each character.
  • Why this step exists: The system needs both words (plot) and pictures (who the characters are) to keep identities stable.
  • Example: Story: “Two friends explore a castle and meet a guard.” Characters: Ava (photo refs), Ben (photo refs), Guard (photo refs).
  1. Multi-agent planning (chapters → locations → scenes → shots)
  • What happens: Four AI agents structure the story:
    • Chapter Agent: splits plot into chapters with who/when/why.
    • Location Agent: builds a library of clean, character-free backgrounds (e.g., Castle, Courtyard, Dungeon) and their descriptions.
    • Scene Agent: expands chapters into scenes and locks each scene to exactly one location from the library; enforces each scene has an odd number of shots (so we alternate shot types consistently).
    • Shot Agent: writes detailed shot directives, including who’s in the shot, emotions, poses, dialogue, camera moves, keyframes for odd shots, and transition metadata for even shots.
  • Why this step exists: Without strict planning, backgrounds and characters drift; transitions lose logic.
  • Example: Scene 3 is bound to “Castle Hall.” It has 5 shots: Shot 1 (I2V), Shot 2 (FLF2V), Shot 3 (I2V), Shot 4 (FLF2V), Shot 5 (I2V). Shot 2’s metadata: “Ava enters from left; Ben stays; movement=Entry.”
  1. Generate a background library (one per location)
  • What happens: A text-to-image model creates a clean, detailed background image for each location name+description (no characters).
  • Why this step exists: Reusing the same background image locks the world for that scene.
  • Example: For “Castle Hall,” generate a throne room with red banners, stone arches, and a skylight—saved as the canonical background.
  1. Keyframe composition (background injection)
  • What happens: For each odd (narrative) shot, an image-to-image model composites the background with the character reference images to produce a keyframe that embeds identity and layout.
  • Why this step exists: Starting the shot from a stable keyframe prevents the model from redrawing the room differently or changing a character’s face.
  • Example: Shot 1 keyframe: Ava and Ben stand near the throne, Ava smiling, Ben looking curious.
  1. Generate odd shots with I2V (from keyframes)
  • What happens: An image-to-video model turns each keyframe into a 5-second narrative video shot, using the shot’s text instructions for motion and camera.
  • Why this step exists: This is where the main storytelling motion lives (dialogue, gestures, camera pans) while inheriting the fixed background.
  • Example: Shot 1 animates Ava pointing to a banner while the camera slowly pushes in.
  1. Generate even shots with FLF2V (transitions)
  • What happens: For each even shot, the system takes the last frame of the previous shot and the keyframe of the next odd shot, plus the transition metadata (who enters/exits and how), and generates a smooth bridging clip.
  • Why this step exists: This is the secret to cinematic, non-jumpy handoffs—especially when people appear or leave.
  • Example: Shot 2: From end of Shot 1 (Ava and Ben alone) to keyframe of Shot 3 (Ava, Ben, and the Guard), the model shows the Guard walking in from the right while the camera gently reframes.
  1. Cross-shot memory and stitching
  • What happens: A simple memory tracks who appeared and the rough layout, helping the next shot stay aligned; then all shots are stitched in order.
  • Why this step exists: Memory reduces identity and layout drift over time; stitching makes the final movie.
  • Example: The throne stays centered across shots; Ava’s hairstyle and outfit remain the same.

The Secret Sauce (what’s clever):

  • Location-grounded background injection decouples world-building from character motion, stopping background drift before it starts.
  • Alternating shot types (odd=I2V, even=FLF2V) ensures each transition has two anchors (last frame + next keyframe) and explicit choreography (enter/exit), so no one teleports.
  • A purpose-built synthetic dataset for multi-subject transitions teaches the model rare but crucial cases: zero-to-many, many-to-zero, and replacements.
  • Agentic planning enforces constraints (one location per scene, odd shot counts, valid character sets), turning a chaotic task into a controlled pipeline.

Concrete mini example (5-shot scene):

  • Shot 1 (I2V): Keyframe with Ava+Ben in Castle Hall → video of them talking.
  • Shot 2 (FLF2V): Last frame of Shot 1 + keyframe of Shot 3 (with Guard) + metadata (Guard enters from right) → smooth walk-in.
  • Shot 3 (I2V): Ava, Ben, Guard discuss a map.
  • Shot 4 (FLF2V): Last frame of Shot 3 + keyframe of Shot 5 (Guard gone) + metadata (Guard exits left) → smooth exit.
  • Shot 5 (I2V): Ava and Ben continue alone; background still the same Castle Hall.

Example with actual data (simplified prompts):

  • Location background: “Castle Hall, stone arches, warm torchlight, red banners, polished floor, no people.”
  • Shot 1 prompt: “Ava (smiling) and Ben (curious) near throne, camera push-in, 5s, soft hand gestures, whispering.”
  • Shot 2 metadata: “Entry: Guard from right edge; Ava+Ben stay; subtle camera pan right.”
  • Shot 3 prompt: “Three lean over a map on a pedestal, camera medium shot, 5s.”

04Experiments & Results

The Test: The team measured how steady the background and characters stayed, how smooth the motion looked, how good the images appeared, and the overall performance rank on a respected benchmark (VBench). These matter because good stories need stable places, recognizable people, and motion that feels natural—otherwise, your brain notices the glitches.

The Competition: InfinityStory was compared to pipelines built from popular models (like Stable Diffusion + CogVideo or Wan), planning systems like MovieAgent, and structured multi-shot methods like Video-Gen-of-Thought and StoryAdapter combos.

The Scoreboard (with context):

  • Background Consistency: 88.94 (best). Think of this like scoring the highest in keeping the stage identical—like getting an A+ when others mostly got B’s.
  • Subject Consistency: 82.11 (best). Characters keep their look—hair, clothes, and face stay recognizable across shots.
  • Average Rank: 2.80 (best overall across metrics). This is like winning the decathlon: not just good at one event but strong across the board.
  • Image Quality and Aesthetic: Slightly lower than some baselines, partly because the system currently runs at 480p and image edits can add small artifacts. That’s like turning in a great essay printed on lower-resolution paper—the ideas and structure are excellent, even if the print isn’t the crispest yet.
  • Motion Smoothness: Competitive with top baselines. Despite focusing on consistency, InfinityStory still keeps motion feeling natural.

Human Study Highlights: In a user study with 20 participants comparing three systems per story, InfinityStory was preferred for background consistency, smooth transitions, motion smoothness, and overall image quality. That means actual viewers noticed the difference, not just algorithms.

Surprising Findings:

  • Removing background injection hurt both subject and background consistency. So, fixing the world helps people too—when the room is stable, the characters’ look drifts less.
  • Turning off multi-character transition training reduced motion smoothness and aesthetics. Teaching the model how people come and go made the entire video feel more polished.
  • Dynamic Degree (a motion magnitude score) was moderate for InfinityStory—not the most active, not the stillest—fitting its goal: prioritize coherent storytelling over wild motion.

Takeaway of the numbers: InfinityStory wins where viewers feel it most—consistent places and characters with transitions that feel like a real movie. Even when some baselines may look slightly sharper frame-by-frame at higher resolutions, the overall story flow and coherence are stronger in InfinityStory.

05Discussion & Limitations

Limitations:

  • Generalization of transitions: The FLF2V transition model can struggle with totally new character mixes or very complex storylines it hasn’t seen. If your scene choreography is unusual, results may be less smooth.
  • Resolution and aesthetics: Running at 480p and using image edits can slightly lower aesthetic scores compared to 720p-native generators.
  • Style coverage: The synthetic data focused on anime/cartoon/3D styles for transitions; photorealistic multi-person transitions may need more targeted data.
  • Timing variety: While transitions are smoother, perfect timing of entrances/exits (speed, micro-pauses) still benefits from more diverse supervision.

Required Resources:

  • A text-to-image model for background creation, an image-to-image compositor for character injection, an image-to-video model for narrative shots, and an FLF2V model for transitions (plus compute to run them sequentially).
  • A multi-agent planning setup (LLMs) to produce structured chapters, locations, scenes, and shots.
  • Storage for background libraries, keyframes, and generated clips for long sequences.

When NOT to Use:

  • If you need ultra-high-resolution cinematic footage (e.g., 4K ads) with perfect photorealism today—InfinityStory prioritizes consistency and structure over max resolution.
  • If your content is purely single-shot or micro-clips—simpler, single-shot models may be faster and sufficient.
  • If you want chaotic, wildly changing backgrounds on purpose—this system deliberately prevents that drift.

Open Questions:

  • How to scale FLF2V to handle complex ensembles (4+ characters) with overlapping entrances/exits in photorealistic style?
  • Can we adaptively decide shot lengths and transition durations from story pacing, not just fixed 5s shots?
  • How to automatically correct identity drift in mid-scene without manual references—e.g., learn a stronger identity memory?
  • What are the best metrics for narrative coherence beyond current VBench scores—how do we measure “story sense” automatically?

06Conclusion & Future Work

Three-Sentence Summary: InfinityStory builds long, story-driven videos by first locking each scene to a fixed background and then training a transition model that understands who enters, exits, and moves between shots. A multi-agent planner structures the script into chapters, locations, scenes, and shots so the pipeline stays organized and consistent. The result is smoother, more coherent videos that score best on background and character consistency benchmarks.

Main Achievement: It introduces a practical, scalable recipe—location-grounded background injection plus multi-subject transition synthesis—showing that explicit world locking and supervised handoffs can finally make long-form AI video feel cinematic.

Future Directions: Improve generalization with more diverse transition data (styles, character mixes, complex choreography), add richer supervision signals (multi-prompt or motion labels), and raise visual fidelity by increasing native resolution and refining edit modules. Better narrative metrics and adaptive pacing could further boost story quality.

Why Remember This: This work flips the script: instead of hoping prompts keep everything steady, it engineers steadiness—first by fixing the world, then by teaching how people move through it. That’s a blueprint others can reuse to make long AI videos that actually feel like movies, not stitched-together clips.

Practical Applications

  • •Create classroom explainer videos where the set (like a lab or map) stays identical across lessons for easier comprehension.
  • •Produce children’s animated stories with recurring characters and steady environments across many episodes.
  • •Generate storyboard-to-animatic drafts for filmmakers with smooth multi-character entrances and exits between shots.
  • •Build brand mascot videos that keep the character’s look identical across campaigns while changing scenes smoothly.
  • •Design training simulations (e.g., first aid, safety drills) with consistent rooms and realistic personnel movement.
  • •Assemble interactive museum or exhibit videos where visitors notice continuity across long narratives.
  • •Prototype narrative video games’ cutscenes quickly with consistent locations and believable cast transitions.
  • •Automate social media serials (multi-part reels) where background and character identity remain stable over weeks.
  • •Create dubbed localizations by regenerating shots with the same scene layout but updated dialogue timing and lip cues.
  • •Support pre-visualization for stage plays by keeping the set fixed while blocking actor entrances and exits.
#long-form video generation#background consistency#multi-agent planning#keyframe-based I2V#FLF2V transitions#multi-subject transitions#synthetic transition dataset#location-grounded injection#story visualization#temporal coherence#character identity preservation#cinematic shot planning#VBench evaluation#agentic pipelines
Version: 1

Notes

0/2000
Press Cmd+Enter to submit