Unified Vision-Language Modeling via Concept Space Alignment

Yifu Qiu; Paul-Ambroise Duquenne; Holger Schwenk

Unified Vision-Language Modeling via Concept Space Alignment

Intermediate

Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk3/1/2026

arXiv

Key Summary

•The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities “speak” the same language.
•They align a strong vision encoder to Sonar using a simple mean-squared-error (MSE) loss in three stages: lots of images, synthetic videos, then high-quality human video captions.
•With the Sonar decoder, v-Sonar can describe videos zero-shot and beats prior models on detailed captioning benchmarks like PE-Video and Dream-1k.
•The Large Concept Model (LCM), a diffusion-based language model trained only on text embeddings, can understand v-Sonar’s visual embeddings without extra training.
•They extend LCM to v-LCM by instruction-tuning on multilingual, multi-task data (M3IT), using the same latent diffusion objective to predict the next embedding.
•v-LCM matches or beats leading vision-language models on many captioning and QA tasks, and wins in 61 out of 62 languages tested.
•OmniSONAR (a newer Sonar) gives stronger, less-collapsed spaces, improving v-Sonar’s retrieval and captioning; contrastive loss helped retrieval but slightly hurt generation.
•The approach shows we can unify text, speech, images, and videos in one concept space and generate directly in that space instead of over tokens.
•This unification enables strong zero-shot transfer, better multilingual support, and simpler pipelines for multimodal apps.

Why This Research Matters

A single, shared concept space lets one system understand and describe what it sees, then speak about it in dozens of languages without retraining for each. That means video captioning for accessibility, learning, or search can work equally well in Burmese, Arabic, or Spanish. Customer support, education, and media apps can answer image or video questions directly in the user’s language. Developers can simplify pipelines by aligning new modalities post-hoc instead of rebuilding models from scratch. This boosts inclusion for communities with fewer resources and speeds up innovation in multimodal AI.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how it’s easiest to work together when everyone uses the same map? For years, AI had great separate maps: one for text, one for speech, and another for pictures or videos. But those maps didn’t always match, so it was hard for a system to move smoothly from what it saw in a video to what it should say in any language.

🍞 Hook: Imagine classmates speaking many different languages trying to build a LEGO city together. If everyone uses a different instruction sheet, the pieces won’t fit.

🥬 The concept: a shared embedding space is a place where different kinds of inputs—words, sounds, images, and videos—are turned into points that mean the same thing if they describe the same idea.

How it works:
1. Take an input (a sentence, a picture, a video clip, or audio).
2. Encode it into a vector (a list of numbers) that captures its meaning.
3. Put all vectors in the same space so “a red ball is bouncing” is close by, no matter if it came from text or video.
Why it matters: Without one shared space, models can’t easily match what they see to what they say, especially across many languages. 🍞 Anchor: A video of two pandas playing and the caption “two pandas are tumbling on the grass” should land at nearly the same spot in this concept space.

The world before: Text and speech got pretty far with language-agnostic embeddings like LASER and SONAR, which grouped sentences from hundreds of languages by meaning. That enabled things like mining parallel sentences and even decoding text from these embeddings. But vision was left out, or it was aligned using big contrastive models (like CLIP), mostly in English. That meant cross-lingual video captioning or QA often lagged, and visual models didn’t plug directly into language-agnostic generators.

🍞 Hook: You know how sometimes you translate a joke from one language and it no longer feels funny? That’s what happens when spaces don’t match well: meaning gets lost.

🥬 The concept: semantic representation is the distilled meaning of content independent of the exact words or pixels.

How it works:
1. Strip away surface details (fonts, accents, lighting, phrasing).
2. Keep relationships and concepts (who did what, where, when).
3. Store it as a vector so similar meanings are close.
Why it matters: Without solid semantic representations, models get distracted by unimportant details and can’t generalize. 🍞 Anchor: “A kid kicks a blue ball” and a video clip showing that action land near each other.

The problem: How do we take a strong vision encoder and map its outputs into the Sonar text space so images/videos and text live together? And can we then use a language model that operates in Sonar’s space to reason about visuals without retraining it on videos?

🍞 Hook: Think of learning to write letters exactly on ruled lines. If your writing doesn’t sit on the right lines, your teacher can’t read it.

🥬 The concept: MSE loss (mean squared error) is a way to measure how far your prediction is from a target, encouraging you to land exactly on the same spot.

How it works:
1. For each paired video and caption, encode the video with a vision encoder and the caption with Sonar.
2. Compute the squared distance between the two vectors.
3. Adjust the vision side so the distance gets smaller.
Why it matters: Without a precise alignment loss, the two modalities might be close-ish but not truly compatible for generation. 🍞 Anchor: If the video says “two pandas,” MSE nudges the video vector to sit where the “two pandas” text vector already lives.

Failed attempts and gaps: Contrastive-only training (like CLIP) is great for retrieval but can shift vectors off the exact manifold that a decoder expects, hurting generation. Also, most vision-language models were English-first, making multilingual generation weaker. The missing piece was aligning vision into a universal, multilingual space where we can both retrieve and generate.

🍞 Hook: You know how drawing with a dotted outline helps you color inside the shape?

🥬 The concept: teacher–student training means a strong, frozen teacher sets targets, and a student learns to match them.

How it works:
1. Freeze the Sonar text encoder (teacher) that knows many languages.
2. Train a lightweight projector on the vision side (student) to match teacher outputs.
3. Use easy steps first (images), then harder (videos), then highest quality (human-checked captions).
Why it matters: Without a steady teacher, the student could drift and never land in the exact space the decoder needs. 🍞 Anchor: The vision student learns to “speak Sonar” by copying the teacher’s answers for each image/video–caption pair.

Real stakes: A single, universal concept space lets you describe a video in Burmese or answer a question about an image in Tajik, even if you never had paired visual data in those languages. It also simplifies building apps: one encoder space, one decoder, many languages and modalities.

🍞 Hook: Picture a world atlas that works for road trips, hiking, and sailing—all in one.

🥬 The concept: a modality-agnostic embedding space treats text, speech, images, and videos equally.

How it works:
1. Use encoders to map all inputs into a shared space designed for meaning, not format.
2. Keep the space stable (freeze it) so decoders can reliably read from it.
3. Align new modalities post-hoc so they “plug in” without retraining the whole system.
Why it matters: Without modality-agnostic spaces, every new task needs its own siloed model and data. 🍞 Anchor: One system captioning a video in Spanish, answering a diagram question in Hindi, and summarizing an audio clip in Swahili, all with the same decoder.

02Core Idea

The “aha!” in one sentence: Map a powerful vision encoder into the multilingual Sonar space with a simple, careful alignment, then let a text-trained diffusion language model operate on those same embeddings—so vision and language become one continuous concept stream.

🍞 Hook: Imagine sliding a puzzle piece (vision) into an existing puzzle (text) so perfectly that the big picture becomes clearer.

🥬 The concept: concept space alignment means teaching visual features to land exactly where text meanings already live.

How it works:
1. Take pairs of videos/images and their captions.
2. Encode captions with Sonar (frozen teacher), encode visuals with a vision encoder plus a small projector.
3. Minimize MSE so visuals and captions with the same meaning share the same coordinates.
Why it matters: Without exact alignment, a decoder trained on Sonar won’t reliably turn visual embeddings into clean text. 🍞 Anchor: A clip of “a boy flying a red kite” and its caption both land in the same neighborhood; the Sonar decoder then easily says, in any language, what’s going on.

Three analogies:

Adaptor plug: v-Sonar is the travel adaptor that lets your camera (vision) fit into any country’s outlet (Sonar languages).
Sheet music: Sonar is the score; v-Sonar teaches instruments (images/videos) to play notes that match the score, so the orchestra (the decoder) harmonizes.
Subway map: All lines (modalities) share the same stations (concepts). v-Sonar makes sure the red line (video) stops at the same stations as the blue line (text).

Before vs. after:

Before: Vision-language models mostly talked in English and needed big joint training. Retrieval was good, but multilingual generation was hard.
After: A light post-hoc mapping lets visuals plug into a universal, multilingual concept space. A text-only diffusion LM (LCM) can reason about images/videos zero-shot. With instruction-tuning (v-LCM), performance becomes competitive or better across many tasks and languages.

🍞 Hook: You know how a chef tastes as they cook, gradually refining the dish?

🥬 The concept: latent diffusion language modeling is like refining a noisy guess of the next idea embedding until it’s just right.

How it works:
1. Start with a clean target embedding (the next sentence/step), add a bit of noise.
2. Train a denoiser to predict the clean embedding from the noisy one, given prior context.
3. At inference, reverse the process: iteratively denoise to produce the next embedding, and decode to text.
Why it matters: Operating in embeddings keeps meaning smooth and language-agnostic; it avoids token-level quirks. 🍞 Anchor: Given the visual-text embeddings so far, LCM predicts the next embedding that decodes to “Two pandas roll down the hill,” in Burmese if you want.

Building blocks:

v-Sonar: a small projector with temporal attention that pools frame features, then maps to the Sonar space.
Teacher–student MSE alignment: freeze Sonar as teacher; train the projector (and lightly the encoder) to match.
Coarse-to-fine curriculum: millions of image–caption pairs → synthetic video → human-checked video captions.
Stable manifold: OmniSONAR provides a more spread-out, healthy space (higher trace and volume), easing alignment and decoding.
LCM/v-LCM: a diffusion LM that predicts next embeddings directly in this shared space; v-LCM is LCM tuned with vision-language instructions.

🍞 Hook: Think of drawing within clean lines. It’s easier to color beautifully when the outline is crisp.

🥬 The concept: OmniSONAR is a stronger, cleaner Sonar space, trained with more data and extra contrastive stages.

How it works:
1. Learn language-agnostic embeddings with broad multilingual coverage.
2. Use contrastive/self-distillation phases so meanings spread out (higher variance), reducing collapse.
3. Keep the encoder–decoder consistent so generation stays faithful.
Why it matters: A healthier space makes alignment easier and decoding better. 🍞 Anchor: With OmniSONAR, v-Sonar achieves higher Recall@1 on PE-Video and stronger BLEU on Dream-1k.

Why it works (intuition): If visuals and text share the same coordinates, then any reasoning or generation system built for that space (like LCM) can use either as context, making zero-shot transfer natural. Diffusion over embeddings smooths out language-specific quirks, so multilingual decoding becomes a simple readout step rather than re-learning for each language.

03Methodology

At a high level: Visual input (image/video) → Perception Encoder features → v-Sonar projector (temporal + pooling) → Sonar-aligned embedding → (a) retrieval or Sonar decoding; (b) feed into LCM/v-LCM to predict next embeddings and generate text.

Step A: Encode frames with a strong vision backbone.

What happens: A state-of-the-art Perception Encoder processes each frame (e.g., 8 frames sampled from a video) into high-level features.
Why this step exists: You need rich, robust visual features before you can align them to language.
Example: A 10-second clip of a chef slicing an onion yields 8 frame embeddings capturing knife, onion, hands, and motion cues.

🍞 Hook: You know how keeping the order of comic panels matters for the story?

🥬 The concept: temporal attention and positional embeddings keep frame order and let frames talk to each other.

How it works:
1. Add time-position encodings to each frame embedding.
2. Run a temporal self-attention layer so frames share context (who moves where, when).
3. Aggregate with attention pooling into a single video embedding.
Why it matters: Without temporal modeling, the video becomes a bag of frames; you’d miss actions and causality. 🍞 Anchor: Slicing happens before dicing; temporal attention keeps that sequence clear in the final embedding.

Step B: Project into Sonar’s space with MSE alignment.

What happens: A lightweight projector maps the pooled visual embedding to 1024-d Sonar space, trained to minimize MSE to the paired caption’s Sonar embedding.
Why this step exists: Sonar decoder and LCM expect embeddings on the Sonar manifold; exact matching enables faithful generation.
Example: For the onion video, the projector is trained so its output matches the Sonar vector of the caption “A person slices an onion on a cutting board.”

🍞 Hook: First ride the baby slope, then the blue, then the black diamond.

🥬 The concept: a coarse-to-fine curriculum eases learning from simple to complex.

How it works:
1. Stage 1 (coarse grounding): 12M image–caption pairs teach broad visual-text mapping.
2. Stage 2 (temporal adaptation): 2M synthetic video–caption pairs add motion understanding.
3. Stage 3 (fine alignment): 200K human-checked video captions refine precision.
Why it matters: Without this staircase, training can wobble or overfit early noise. 🍞 Anchor: After Stage 3, v-Sonar nails detailed actions like “a child hands the kite string to his sister.”

Step C: Stabilize training with good engineering.

What happens: Initialize projector with tiny Gaussian weights; warm up by training the projector alone; then joint training with asynchronous learning rates (higher for projector, lower for encoder). Add attention pooling and temporal attention.
Why this step exists: Random large updates can destabilize the pretrained encoder; careful schedules prevent drift.
Example: Compared to full fine-tuning from scratch, async LR and normalized init improved BLEU and alignment consistency in ablations.

Secret sauce #1: Freeze the teacher, shape the student.

Freezing Sonar locks a stable target. With MSE, the projector learns to land exactly where the decoder and LCM already operate, enabling immediate zero-shot generation and reasoning.

Secret sauce #2: Healthy manifold via OmniSONAR.

OmniSONAR’s higher trace and logdet (spread and volume) mean less collapsed neighborhoods. Alignment lands in a space the decoder can navigate cleanly, improving retrieval and captioning.

Secret sauce #3: Operate in latent space end-to-end.

Because visuals and text live together, LCM can read visual embeddings as if they were textual context. v-LCM instruction-tunes this behavior across tasks and languages without changing the fundamental generation recipe.

🍞 Hook: Like whispering an idea along a chain until a full sentence forms.

🥬 The concept: LCM’s diffusion next-embedding prediction treats ideas as smooth points and denoises toward the right next point.

How it works:
1. Add noise to the true next embedding.
2. Train a denoiser to reconstruct the clean embedding, given prior context embeddings.
3. At inference, iteratively denoise to sample the next embedding; decode to text with the Sonar decoder.
Why it matters: This keeps generation language-agnostic and meaning-centered. 🍞 Anchor: Given visual context of pandas plus an instruction in English, LCM/v-LCM outputs an embedding that decodes to a correct Burmese summary.

Putting it together for tasks:

Retrieval: Compare cosine similarity between a text query (Sonar) and video embeddings (v-Sonar). Higher alignment consistency boosts Recall@1.
Captioning: Feed the video embedding to the Sonar decoder to verbalize directly, or let LCM/v-LCM generate the next embeddings for richer, instruction-following outputs.
Multilingual QA: Encode instruction and question with Sonar, visual with v-Sonar, concatenate, and let v-LCM generate answers in the asked language.

Example walkthrough (Dream-1k):

Encode video frames with Perception Encoder → temporal attention → pooled.
Project to Sonar space (v-Sonar embedding).
For zero-shot captioning, decode with Sonar decoder → a detailed, multilingual caption.
For instruction-tuned QA, feed embeddings to v-LCM → predict next embeddings → decode to an answer.

04Experiments & Results

What they tested and why:

Text-to-video retrieval: Can v-Sonar put videos and captions so close that the right video is found at the top? This stresses cross-modal alignment quality.
Video captioning: Can the Sonar decoder (or LCM) verbalize v-Sonar’s embeddings into accurate, detailed, multilingual captions? This checks manifold compatibility for generation.
Zero-shot LCM: Can a text-only diffusion model reason over visual embeddings without any video training? This demonstrates true unification.
v-LCM (instruction-tuned): Does tuning LCM on multimodal tasks in many languages bring it to or beyond state-of-the-art?

Competitors:

SigLIP2 and Perception Encoder baselines (vision encoders) for retrieval.
InternVL, Qwen-VL, and Perception LM (1B–3B scales) for captioning and QA.

Scoreboard with context:

Retrieval (PE-Video): v-Sonar Recall@1 = 73.03 versus SigLIP2-G-OPT at 63.91 and the base Perception Encoder at 63.91. That’s like getting an A when peers get B’s—a clear margin for top-1 hits.
Retrieval (Vatex): v-Sonar Recall@1 = 40.75 vs. SigLIP2’s 27.52 and PECoreG’s 18.90—huge gains in a tough multilingual set.
Captioning (PE-Video): v-Sonar + OmniSONAR decoder BLEU 39.0 vs. best prior around 30.0 (Qwen2.5-VL-3B). That’s a jump from a solid B to an A- in clarity and detail.
Captioning (Dream-1k): v-Sonar gets BLEU 23.9 vs. prior 19.6—again a notable boost on fine-grained, descriptive captions.
Captioning (Vatex-en): v-Sonar slightly behind InternVL2.5 on short single-sentence captions, but still competitive with Qwen/PLM; on Vatex-zh, v-Sonar outperforms the InternVL series, highlighting multilingual strengths.

Surprising findings:

Contrastive loss improved retrieval but hurt captioning quality slightly, likely by shifting embeddings off the Sonar decoder’s expected manifold (different norms/local covariance). This explains why MSE-only alignment was chosen for generation-centric use.
LCM, trained only on text, showed competitive zero-shot video captioning and could summarize long videos when fed v-Sonar embeddings—evidence that it truly reasons in the shared concept space.
Using v-Sonar embeddings directly beat feeding LCM captions re-encoded with Sonar, especially for longer videos. This shows v-Sonar preserves richer visual details than text-only summaries.

v-LCM on M3IT (multilingual, multimodal):

Tasks across image captioning (COCO), visual QA (VQA-V2, OKVQA, VisualMRC), video captioning (MSRVTT), and video QA (MSRVTT-QA, ActivityNetQA, IVQA).
v-LCM hit state-of-the-art or competitive numbers, especially strong on video QA (e.g., high Rouge-L on IVQA and ActivityNetQA), while remaining close to top models on image/document tasks.
Multilinguality: v-LCM outperformed strong models in 61/62 languages. Gains were modest in very high-resource languages but substantial in mid/low-resource settings (e.g., Burmese, Tajik, Telugu). It could produce meaningful outputs where some baselines failed entirely (e.g., Urdu, Modern Standard Arabic, Tamil).

Ablations that mattered:

Architecture: Linear projection with a frozen encoder beat naïve full fine-tuning at first; then asynchronous learning rates, careful initialization, attention pooling, and temporal attention successively improved metrics.
Data curriculum: Both image caption pre-stage and synthetic video mid-stage each added small but consistent gains, confirming the value of progressive adaptation.
Sonar vs. OmniSONAR: OmniSONAR had higher trace/logdet (more spread), giving better captioning/retrieval and higher oracle scores. Sonar1 remained competitive where LCM compatibility was required but was harder to align due to partial collapse.

Bottom line: v-Sonar achieved best-in-class zero-shot retrieval on multiple video sets and raised video captioning quality, especially for detailed descriptions. LCM could reason visually zero-shot, and v-LCM pushed multilingual multimodal generation to the front across dozens of languages.

05Discussion & Limitations

Limitations:

Dependence on caption data: The alignment leans on image/video–caption pairs. If captions are sparse, noisy, or culturally biased, the mapping may inherit those issues.
Short vs. detailed captions: v-Sonar is strongest on detailed descriptions; it can trail on very short, single-sentence benchmarks like VATEX-en.
Manifold sensitivity: The Sonar decoder expects embeddings with certain norms/covariance; contrastive training that improves retrieval can degrade generation, so there’s a trade-off to manage.
Document/image reasoning: v-LCM lags top models on some document-heavy or specialized reasoning tasks (e.g., VisualMRC), suggesting room for better spatial/layout grounding or task-specific tuning.

Required resources:

Training v-Sonar used large-scale GPU clusters (e.g., $64×A100$ for alignment stages) and sizeable datasets (12M images, 2M synthetic videos, 200K human-checked videos).
v-LCM training used $8×A100$ with instruction-tuning on M3IT; still non-trivial compute and curation effort.

When not to use:

If your task needs pixel-precise outputs (fine-grained segmentation/matching) rather than semantic text, a token-level or pixel-space model may fit better.
If you must optimize top-1 retrieval only and never generate text, a pure contrastive model (e.g., SigLIP2) could be simpler, though v-Sonar is strong too.
Extremely domain-specific jargon or diagrams not reflected in caption data may need domain adaptation.

Open questions:

Can we jointly optimize for retrieval and generation without manifold drift (e.g., constraint-aware contrastive objectives)?
How far can we push spatial grounding and layout reasoning with only semantic-level alignment, or should we add light bounding-box or region-level cues?
Could we extend the same alignment recipe to other modalities (e.g., 3D point clouds, medical imaging) while retaining multilingual strengths?
What are the best ways to detect and correct bias in captions so alignment doesn’t encode it?
Can diffusion over embeddings be accelerated further for real-time streaming video captioning and QA?

06Conclusion & Future Work

Three-sentence summary: This paper maps a powerful vision encoder into the multilingual Sonar concept space (v-Sonar), enabling images and videos to share the same coordinates as text. Because a diffusion language model (LCM) already operates in this space, it can immediately reason over visual inputs, and with instruction-tuning (v-LCM), match or beat leading vision-language models across tasks—especially in 61 of 62 languages tested. The core result is a simple, stable, and universal recipe for unifying modalities and languages in one continuous meaning space.

Main achievement: Demonstrating that post-hoc MSE-based alignment plus a healthy multilingual manifold (OmniSONAR) lets us do both strong retrieval and high-fidelity multilingual generation directly from visual embeddings, and that a text-only diffusion LM can zero-shot process vision in this unified space.

Future directions:

Develop training objectives that jointly preserve manifold faithfulness for generation while boosting retrieval.
Add lightweight spatial grounding (regions, boxes) without sacrificing simplicity.
Extend alignment to more modalities (e.g., audio-visual-events, 3D) and more low-resource languages.
Speed up diffusion inference for real-time applications.

Why remember this: It shows a practical, scalable path to one concept space for text, speech, images, and videos, enabling zero-shot multimodal reasoning and broad multilingual support with a single decoder—simplifying systems while improving reach and performance.

Practical Applications

•Multilingual video captioning for news, education, or accessibility across 60+ languages.
•Cross-language video search where a Thai query finds the right English-captioned video via shared embeddings.
•Visual question answering for help desks: users upload a photo/video and get answers in their language.
•Content moderation and safety checks that describe and reason about videos regardless of language.
•Sports highlights summarization that generates match recaps in the viewer’s preferred language.
•eCommerce visual assistance: describe product videos and answer questions in low-resource languages.
•Education platforms that explain science experiment videos in students’ native languages.
•Media asset management: unified embedding indexing for images/videos to streamline retrieval and tagging.
•Assistive technology for the visually impaired: real-time scene descriptions in local languages.
•Enterprise knowledge tools that summarize training videos and answer compliance questions globally.

Version: 1