Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong; David Fan; John Nguyen; Ellis Brown; Gaoyue Zhou; Shengyi Qian; Boyang Zheng; Théophane Vallaeys; Junlin Han; Rob Fergus; Naila Murray; Marjan Ghazvininejad; Mike Lewis; Nicolas Ballas; Amir Bar; Michael Rabbat; Jakob Verbeek; Luke Zettlemoyer; Koustuv Sinha; Yann LeCun; Saining Xie

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Intermediate

Shengbang Tong, David Fan, John Nguyen et al.3/3/2026

arXiv

Key Summary

•The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.
•A single semantic visual encoder (RAE-style, e.g., SigLIP 2) works best for both understanding images and generating them, so you don’t need two different visual pathways.
•Adding visual data (especially video) does not hurt language ability and often helps; image–text pairs are essential for visual skills.
•World modeling (predicting what you’ll see next after an action) largely emerges from general multimodal pretraining and needs only a tiny bit of navigation-specific data.
•Mixture-of-Experts (MoE) lets the model assign different specialist sub-networks to different tokens, naturally separating vision and language capacity.
•Vision is more data-hungry than language; scaling laws show you need more visual tokens as compute grows, but MoE helps balance this asymmetry.
•MoE with fine-grained experts and per-modality shared experts yields better results than hand-crafted separation or dense models at the same compute.
•Recaptioned data improves understanding while high-aesthetic data boosts generation; using both sources together works best.
•Representing actions as plain text (like 'dx=+1.3') allows the same model to plan and predict future views with no architecture changes.
•Overall, the paper provides a clear recipe: use RAE semantic latents, embrace diverse multimodal data, and scale with MoE to build truly unified models.

Why This Research Matters

Unified multimodal models that both read and see can better understand real-life tasks: they connect words to what the world looks like and how it changes. With a single semantic visual representation, systems become simpler, faster to iterate, and more reliable across tasks. Synergy between modalities means you don’t need massive task-specific datasets—diverse pretraining already provides a strong foundation. World modeling emerging from general training opens doors for safer navigation, assistive robotics, and AR/VR guidance. MoE makes large models more efficient, delivering specialization without extra per-token compute. The scaling-law insights help teams plan data and compute budgets wisely, especially for video-heavy applications. Ultimately, this brings AI closer to grounded reasoning—understanding not just text, but the physics and dynamics of the world.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how reading about a place in a book is helpful, but actually visiting it teaches you more? For years, AI models mostly learned from text, which is like reading descriptions. That created amazing language models that can write, summarize, and reason—but they rarely "saw" the real world’s sights and motions. Text is powerful, but it’s also a compressed shadow of reality: it misses physics, geometry, and the constant flow of what happens next in videos. Meanwhile, the internet has a limited amount of top-quality text, and we’re getting close to using most of it. In contrast, the visual world—images and especially videos—offers a nearly endless stream of rich signals about how things look, move, and interact.

The problem was that building models that learn natively from both language and vision (multimodal models) has been confusing. Many recent systems start from a pretrained language model and then bolt on vision later. That makes it hard to tell which abilities came from the original language training and which truly came from learning vision+language together. People also thought you needed two separate visual representations: one (VAE) for generating images and another (semantic encoders like SigLIP or DINOv2) for understanding them. This dual setup complicates training and inference and risks the two parts not talking well to each other. Architecturally, dense models share the same parameters for everything, which can make language and vision compete for capacity. Data-wise, people noticed that adding image captions sometimes hurts language perplexity, but no one was sure if vision itself was to blame or if it was the kind of text in captions.

Several attempts struggled. Dual-encoder designs added complexity and overhead, often underperforming when compared to a single high-quality semantic encoder. Dense (non-sparse) backbones forced a one-size-fits-all capacity allocation across modalities, making it hard to scale both well at the same compute. And when people saw language performance dip with image–text data, they sometimes concluded that vision and language couldn’t happily co-train, rather than suspecting text distribution shifts (caption style vs. web text style).

What was missing was a clean, controlled study training from scratch that varied one factor at a time: the visual representation (VAE vs. semantic latents vs. pixels), the data mixture (text, video, image–text pairs, action-conditioned video), the architecture (shared vs. modality-specific FFNs vs. Mixture-of-Experts), and the scaling behavior for each modality. By doing this, we could isolate root causes: Is vision really fighting language? Do we truly need two visual pathways? How should we split capacity? What kinds of data produce synergy?

Why this matters to everyday life: Better multimodal models can describe photos more accurately, create images that match the prompt’s meaning (not just style), and understand videos to help with safety, sports analysis, education, and accessibility. World modeling—predicting what you’ll see after an action—matters for robots, AR/VR, and navigation apps. If a single model can read, see, and predict the future state of the world, it becomes a more helpful assistant: it can plan routes from camera input, explain what it’s seeing in natural language, and even visualize outcomes before they happen. This paper shows we can get there more simply than we thought: use one semantic visual representation, train on diverse data, and allocate capacity smartly with MoE.

02Core Idea

Aha! Train one unified model from scratch that treats language and vision as equal citizens: predict the next text token for language and perform diffusion-style denoising for visual latents, using a single semantic visual encoder (RAE-style) plus Mixture-of-Experts to dynamically route capacity to the right specialist.

Analogy 1 (Orchestra): Imagine an orchestra with many instruments (text, images, video). Instead of having two conductors (one for understanding, one for generation), we hire one wise conductor (a single encoder) and seat expert players in sections (MoE experts) who step in when their parts are needed.
Analogy 2 (Kitchen): In a kitchen, one high-quality knife (semantic encoder) can slice many foods well; you don’t need a separate knife for each ingredient. A smart head chef (MoE router) chooses which sous-chef (expert) prepares each item.
Analogy 3 (Sports Team): One team plays both offense (generation) and defense (understanding). Coaches (MoE) sub in the right specialists for each play, but the team still follows one playbook (unified transformer).

Before vs. after: Before, people often bolted vision onto a language model, sometimes with two visual tokenizers (one for understanding, one for generation), and saw mixed results and capacity conflicts. After, we see that one RAE-style semantic encoder is enough for both understanding and generation, diverse multimodal data brings synergy (especially video for language and image–text for vision), and MoE makes scaling efficient and balanced.

Why it works: High-dimensional semantic latents (RAE-style) carry rich meaning that diffusion can denoise well, so the same representation supports both generation and understanding. Diverse data exposes the model to how words, images, and motion relate, making aligned skills emerge. MoE lets tokens pick specialists so language and vision don’t fight over the same parameters. Treating actions as plain text plugs planning into the same pipeline, so world modeling appears without extra modules.

Building blocks:

Transfusion-style backbone that mixes two objectives: next-token prediction for text and flow-matching diffusion for vision.
One visual encoder (e.g., SigLIP 2) decoded via RAE to and from pixels.
Hybrid attention masking so each image frame has full spatial context while the sequence remains causal.
Data mixtures: text, raw video, image–text pairs, and action-conditioned video.
MoE with fine-grained experts, per-modality shared experts, and routing that learns specialization.
Scaling laws that reveal vision is more data-hungry; MoE narrows the gap by giving language more flexible capacity without extra compute.

Concept sandwiches (in learning order):

🍞 Mixed-Modal Training (Hook): You know how learning both to read music and play piano makes you a better musician than just doing one? 🥬 What it is: Training a model on words and visuals together so it learns from both at once. How: (1) Feed text, images, and videos; (2) Process them in one transformer; (3) Train with language and vision objectives; (4) Let each modality help the other. Why it matters: If you only read or only look, you miss connections between language and the visual world. 🍞 Anchor: The model reads “a red bird on a branch” and also sees bird pictures and videos, learning better what “red bird” truly looks like.

🍞 Transfusion Framework (Hook): Imagine a relay race where one runner handles words and another handles pictures, but they pass the baton smoothly. 🥬 What it is: A setup that combines next-token prediction for text with diffusion-style denoising for vision inside one transformer. How: (1) Tokenize text; (2) Encode images/video into latents; (3) Apply language loss on text tokens; (4) Apply flow-matching loss on visual latents; (5) Train jointly. Why it matters: It unifies reading and seeing without two separate models. 🍞 Anchor: The model can caption an image and also generate an image from a caption using the same brain.

🍞 Next-Token Prediction (Hook): Like guessing the next word when your friend pauses mid-sentence. 🥬 What it is: The model predicts the next text token from previous ones. How: (1) Look at earlier words; (2) Assign probabilities; (3) Learn to increase the probability of the right word. Why it matters: This is how language ability is built. 🍞 Anchor: From “Paris is the capital of…”, it predicts “France.”

🍞 Flow Matching (Diffusion) (Hook): Think of cleaning a blurry photo layer by layer until it’s sharp. 🥬 What it is: A way to train the model to denoise visual latents from noisy to clean. How: (1) Mix clean latent with noise; (2) Predict the “velocity” to move toward clean; (3) Repeat across timesteps; (4) Decode to pixels. Why it matters: This powers high-quality image/video generation and learning visual structure. 🍞 Anchor: Given “a yellow flower in sunlight,” the model denoises until a crisp sunflower appears.

🍞 Representation Autoencoder (RAE) (Hook): Like storing a detailed sketch instead of a tiny thumbnail. 🥬 What it is: Using high-dimensional semantic latents that are good for both understanding and generation. How: (1) Encode an image into rich latents (e.g., SigLIP 2); (2) Train diffusion in that space; (3) Decode back to pixels. Why it matters: One representation does both jobs well—simpler and stronger. 🍞 Anchor: The same latents help answer “how many dogs?” and also draw those dogs.

🍞 Mixture-of-Experts (MoE) (Hook): Like a hospital where the triage nurse routes you to the right specialist. 🥬 What it is: Many small expert networks; a router picks a few per token. How: (1) Split FFNs into experts; (2) Router maps tokens to experts; (3) Train with load-balancing; (4) Grow experts without raising per-token compute. Why it matters: Language and vision get the capacity they need without stepping on each other. 🍞 Anchor: Text tokens often visit language experts; image tokens visit vision experts; deep layers blend them.

🍞 Scaling Laws (Hook): If you bake more cookies, how much more flour and sugar do you need? 🥬 What it is: Rules telling how to size models and datasets as compute grows. How: (1) Fix compute; (2) Sweep model size and data; (3) Fit curves to find optimal trades; (4) Compare vision vs. language exponents. Why it matters: Guides efficient growth—vision needs more data than language. 🍞 Anchor: With more compute, you add more visual tokens than text tokens to stay optimal.

🍞 World Modeling (Hook): Imagine a video game that can predict what the screen will look like after you press “WASD.” 🥬 What it is: Predicting the next visual state given the current view and an action. How: (1) Feed context frames + action (as text); (2) Predict the next frame latents; (3) Plan by simulating multiple action sequences; (4) Pick the best. Why it matters: This enables navigation, robotics, and planning without special modules. 🍞 Anchor: From four hallway photos and the command “turn left,” the model imagines the new view around the corner.

03Methodology

High-level pipeline: Input (text, images, videos, or actions-as-text) → Tokenize/encode → One transformer with hybrid attention → Two objectives (language next-token and visual diffusion/flow matching) → Output: generated text or denoised visual latents decoded to pixels.

Preparing inputs

Text: Tokenized with a standard BPE. We use causal (left-to-right) attention for next-token prediction.
Visuals: Each image or video frame is encoded into rich semantic latents (RAE-style; e.g., SigLIP 2). Frames are marked with special tokens so the model knows frame boundaries.
Actions as text: Navigation actions like dx, dy, dyaw, re $l_t$ are written as plain text strings (“action: dx=+1.33, …”), so the same tokenizer handles them.

Hybrid attention masking

What happens: Text tokens attend causally to previous tokens. Visual tokens inside the same frame attend to each other bidirectionally (full spatial context) but only causally to earlier frames or text. This lets the model “see the whole picture” within a frame while still marching left-to-right overall.
Why it exists: Without full within-frame context, image denoising would be blurry; without causality, the model would cheat by peeking ahead in the sequence.
Example: If the sequence is [Prompt text] → [Frame 1 latents] → [Frame 2 latents], then Frame 2 can see Frame 1 and the prompt, but not any future frames.

Language objective: next-token prediction

What happens: Predict the next token’s probability distribution, then learn to increase the probability of the true next token.
Why: This trains strong reading, writing, and alignment to prompts.
Formula: $L_{LM} = - \sum_{i=1}^{n} \log p_{\theta}(x_i \mid x_{<i})$ . Numerical example: Suppose the true next three words have predicted probabilities $0.8$ , $0.5$ , and $0.25$ . Then $L_{LM} = -(\log 0.8 + \log 0.5 + \log 0.25) \approx -(-0.223 - 0.693 - 1.386) = 2.302$ .

Vision objective: flow matching (diffusion)

What happens: For each visual frame’s latents $z$ , sample a timestep $t$ , mix with noise $\varepsilon$ to get $z_t = (1-t)\,\varepsilon + t\,z$ , then predict the velocity to move from $z_t$ toward $z$ .
Why: This teaches the model to denoise latents into clean images/videos.
Formula: $L_{flow} = \mathbb{E}_{t,z,\varepsilon}[\,\| v_{\theta}(z_t, t, \cdot) - (z - \varepsilon) \|^2\,]$ . Numerical example: Let $z=2.0$ , $\varepsilon=0.5$ , $t=0.3$ . Then $z_t = 0.7\times 0.5 + 0.3\times 2.0 = 0.35 + 0.6 = 0.95$ . If the model predicts $v_{\theta}=1.3$ , the target is $z-\varepsilon=1.5$ , and the loss is $(1.3-1.5)^2 = 0.04$ .

Joint training

What happens: Mix batches from multiple sources (text-only, video-only, image–text pairs, action-conditioned sequences). Optimize a weighted sum of language and vision losses.
Why: Balances training so neither modality dominates.
Formula: $L = \lambda_{LM} L_{LM} + \lambda_{flow} L_{flow}$ . Numerical example: If $\lambda_{LM}=1.0$ , $\lambda_{flow}=3.0$ , $L_{LM}=2.30$ and $L_{flow}=0.04$ , then $L = 1.0\times 2.30 + 3.0\times 0.04 = 2.42$ .

Visual representation: one RAE-style encoder

What happens: Use a single high-dimensional semantic encoder (e.g., SigLIP 2) for both understanding and generation; decode with an RAE decoder back to pixels.
Why: This unifies representations: simpler training and better transfer between understanding and generation.
Example: The same latent for a dog picture helps the model answer “what color is the collar?” and also generate a matching dog image from text.

Capacity separation with Mixture-of-Experts (MoE)

What happens: Replace a dense FFN with many experts. A learned router picks top-k experts per token. Include per-modality shared experts (one always available for text, one for vision) and a load-balancing term so all experts get used.
Why: Language and vision need different amounts and kinds of capacity. MoE gives each token the specialists it needs without raising per-token compute.
Granularity formula: $G = \frac{4 d_{model}}{d_{expert}}$ . Numerical example: If $d_{model}=4096$ and $d_{expert}=1024$ , then $G = \frac{4\times 4096}{1024} = 16$ .
Example: Early layers often route text tokens to text experts; later layers activate more vision and multimodal experts where fusion happens.

Inference

Text: Standard autoregressive generation.
Images/videos: Run the denoising sampler (e.g., 25 steps), decode latents to pixels, optionally with classifier-free guidance.
Actions: For planning, text actions guide predictions of future frames; a search algorithm (e.g., CEM) tries multiple action sequences and picks the best.

Secret sauce

One encoder (RAE-style semantics) for both understanding and generation avoids the complexity and mismatch of dual pathways.
Diverse data (video, image–text, text, and action-conditioned sequences) yields synergy: visual data helps language-aligned generation and even VQA, while text improves text-to-image alignment.
MoE discovers specialization: most experts are text-focused overall, but deeper layers show more vision and multimodal experts; remarkably, the same vision experts serve both understanding and generation, confirming true unification.

04Experiments & Results

The tests (what and why)

Language: Perplexity (PPL) on in-domain (DCLM) and out-of-distribution (Notes) text checks if multimodal training harms reading/writing skills.
Vision generation: Diffusion loss (training signal), FID and DPGBench/GenEval (alignment and quality) check how well the model draws what was asked.
Visual understanding: VQA accuracy across many benchmarks measures if the model connects text to what it sees.
World modeling: Navigation metrics (ATE, RPE) measure how well the model predicts the next view after an action and plans sequences of actions.
Knowledge-informed generation (WISE): Tests whether language world knowledge shows up in generated images.

Competitors

Text-only and T2I-only (generation-only) baselines.
Dense vs. MoE architectures.
Dual encoders (semantic for understanding + VAE for generation) vs. one semantic encoder.

Scoreboard with context

Visual representation: Semantic RAE-style latents (e.g., SigLIP 2) beat VAE latents on both understanding and generation—like getting an A in both reading pictures and drawing them—while VAEs do fine on reconstruction but lag on semantics. Raw pixels trail in generation but aren’t far behind in understanding.
Data mixtures: Adding video to text matches or slightly improves text PPL (no language penalty), showing vision itself isn’t the culprit. Image–text pairs are essential for both understanding and generation. Recaptioned data boosts VQA; high-aesthetic data (SSTK) boosts generation; combining them gives the best across-the-board results.
Synergy: Training 20B VQA tokens plus 80B general data (text, video, or image–text) beats a 100B VQA-only model, proving diverse pretraining is stronger than just scaling in-domain data.
World modeling: The biggest gains come from adding pure video to NWM data; even text and image–text help. Performance saturates with as little as ~1% domain data—most skill transfers from general multimodal pretraining.
Architecture: Modality-specific FFNs already help. MoE with fine granularity and per-modality shared experts beats dense and hand-crafted separation (MoT) at the same compute. Experts naturally specialize by modality; deeper layers show more vision/multimodal fusion. Vision experts do not split by diffusion timestep and are shared between understanding and generation (very high correlation of routing patterns).
Scaling laws: Dense compute-optimal trends show language roughly balanced ( $a\approx 0.47$ , $b\approx 0.53$ ) while vision is more data-hungry ( $a\approx 0.37$ , $b\approx 0.63$ ). MoE narrows the gap (e.g., language $b\approx 0.59$ vs. vision $b\approx 0.64$ ), making unified scaling more feasible.

Surprises

One semantic encoder outperforms dual encoders on both semantics and generation, simplifying the system.
Language helps vision: better text modeling improves text-to-image alignment, raising GenEval scores.
Unlabeled video helps even text tasks slightly and strongly helps VQA and world modeling.
The same vision experts handle both understanding and generation—true unification rather than two separate pipelines.

A simple scaling-law intuition formula and example: If optimal model size and data scale like $N_{opt} \propto C^{a}$ and $D_{opt} \propto C^{b}$ with $a+b=1$ , then doubling compute multiplies size by $2^{a}$ and data by $2^{b}$ . Numerical example: With $a=0.41$ and $b=0.59$ (language in MoE), doubling compute makes $N_{opt}$ about $2^{0.41}\approx 1.33$ times larger and $D_{opt}$ about $2^{0.59}\approx 1.51$ times larger; vision with $b\approx 0.64$ would need about $2^{0.64}\approx 1.56$ times more data.

05Discussion & Limitations

Limitations

Slight out-of-distribution language dip: With some image–text data, perplexity on OOD text (like a Notes corpus) degrades a bit, likely due to caption-style text distribution. Careful data curation or mixing ratios may reduce this.
Reconstruction gap: Semantic encoders can lag VAEs on ultra-fine pixel reconstruction. Bridging this without losing semantic strength is an open problem.
MoE systems engineering: Imbalanced expert loads and routing overhead can reduce hardware efficiency; smarter routing and infrastructure are needed.
Data coverage: Interleaved mixed sequences (e.g., tightly woven text-image alternation) were not deeply explored and could unlock further gains.

Required resources

Significant compute and large, diverse datasets: hundreds of billions of text tokens plus comparable multimodal tokens (video, image–text, action-conditioned).
High-quality semantic encoders and decoders (RAE-style) and a scalable training stack supporting sparse MoE.

When not to use

If you only need text tasks and have limited compute, a text-only LLM may be simpler and cheaper.
If you require pixel-perfect reconstructions (e.g., medical imaging diagnostics with strict fidelity), a dedicated VAE pipeline might be preferable today.
Extremely low-latency edge scenarios where sparse routing overhead outweighs benefits.

Open questions

Can we design semantic encoders that close the last-mile reconstruction gap with VAEs?
What’s the best way to mix interleaved data (paragraph–image–paragraph) to benefit both modalities and reduce OOD text dips?
How do reinforcement learning or multimodal RL further unlock planning and controllability using the same unified latents?
Can we tokenize video more natively (beyond frame-wise) to capture long-range motion with fewer tokens?
How can routing be made fairer and more efficient so experts specialize deeply without hardware bottlenecks?

06Conclusion & Future Work

Three-sentence summary: The paper shows that you can train a single model from scratch to read and see by combining next-token prediction for text with diffusion-style denoising for vision, all inside one transformer. A single RAE-style semantic visual encoder is enough for both understanding and generation, diverse multimodal data brings synergy (including world modeling with actions-as-text), and Mixture-of-Experts scales capacity efficiently while naturally specializing by modality. Scaling laws reveal vision is more data-hungry than language, and MoE helps reconcile this asymmetry so unified models can grow gracefully.

Main achievement: A clear, evidence-backed recipe for unified multimodal pretraining—RAE semantic latents + diverse data (text, video, image–text, actions-as-text) + MoE routing—yielding strong language, vision understanding, generation, and emergent world modeling in one model.

Future directions: Improve semantic encoders to match VAE-level reconstruction while keeping strong semantics; scale unlabeled video even further; refine MoE routing for efficiency and fairness; explore interleaved data schedules; and add mild RL or agentic objectives to strengthen planning and controllability. On the science side, deepen scaling-law studies for video and action, and investigate how expert specialization evolves at trillion-token scales.

Why remember this: It simplifies multimodal AI—one representation, one backbone, learned capacity allocation—while unlocking practical abilities like knowledge-grounded image generation and language-guided navigation. As we move beyond language-only shadows, this work points to models that both read the world and predict how it will change, bringing AI closer to grounded understanding and purposeful action.

Practical Applications

•Assistive technology that can describe a scene, answer questions about it, and generate helpful visual guides for people with low vision.
•Robotics navigation where the robot predicts future views from camera input and language-like action commands.
•Education tools that read a paragraph, visualize the concept (e.g., the water cycle), and quiz students with visuals and text.
•Creative design assistants that align strongly to prompts, generating images that match exact textual intent and facts.
•Video analysis that summarizes games, detects important moments, and explains tactics using both text and frames.
•AR/VR guidance that previews what you’ll see after a turn or action, aiding indoor navigation or assembly instructions.
•Content moderation and safety systems that understand both captions and visuals to flag risky or misaligned content.
•E-commerce assistants that answer visual questions (fit, color, style) and simulate try-ons or product placements.
•Scientific tools that pair papers’ text with visualizations and generate hypothesis-driven figures from descriptions.
•Digital twins and simulation planning where actions are text-like controls and future states are visual rollouts.

Version: 1