VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen; Chenxiao Zhao; Xiang Cheng; Lei Huang; Xing Yu

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Intermediate

Guobin Shen, Chenxiao Zhao, Xiang Cheng et al.2/11/2026

arXiv

Key Summary

•VESPO is a new, stable way to train language models with reinforcement learning even when training data comes from older or mismatched policies.
•It turns the problem of reshaping importance weights into a principled "find a good helper distribution" task and solves it with a clean formula.
•Instead of clipping each token or normalizing by length, VESPO works on whole sequences directly, which keeps the story of the response intact.
•Its soft kernel, roughly phi(W) = W^c1 * exp(c2 * (1 - W)), keeps on-policy samples strong and gently shrinks extreme weights.
•This design reduces the huge variance that usually explodes for long responses, without introducing length bias.
•On math reasoning benchmarks, VESPO is stable up to 64× staleness and in fully asynchronous training, beating or matching strong baselines.
•It shines especially on Mixture-of-Experts models, where routing mismatches usually make training wobbly.
•VESPO also plays nicely with engineering fixes like router replay (R2), giving even better performance when combined.
•The key idea—"weight reshaping equals choosing a proposal distribution"—gives a unifying, theory-backed view of many past tricks.
•Overall, VESPO makes off-policy LLM RL more reliable, scalable, and easier to run in real systems.

Why This Research Matters

Real-world RL for LLMs often runs off-policy, in big batches, and asynchronously, so stability under staleness and mismatches is essential. VESPO’s smooth, sequence-level kernel avoids length bias while taming extreme importance weights, making updates reliable even for very long answers. This lets teams scale training without babysitting collapses, reducing wasted compute and engineering overhead. Its measure-change perspective unifies many past tricks and offers a principled foundation for future methods. It is particularly helpful for Mixture-of-Experts models, where routing mismatches commonly break training. Because it also complements engineering fixes like router replay, practitioners can combine approaches for even stronger, safer training.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how coaches try different drills to help a team play better, but the team might be practicing last week’s plays while the coach is already teaching new ones? That mismatch can make practice confusing and lead to bad habits. Training large language models (LLMs) with reinforcement learning (RL) has a similar issue: the data used to update the model often comes from an older policy, not the latest one, and that can destabilize learning.

🍞 Hook: Imagine teaching a robot to solve math problems step by step, but the examples it studies were written by yesterday’s version of the robot. 🥬 The Concept (Reinforcement Learning): RL is a way to train models by giving them rewards for good behavior. How it works: (1) the model (policy) tries something; (2) it gets a reward; (3) it updates itself to get more reward next time. Why it matters: without RL, models don’t naturally improve at multi-step tasks like reasoning. 🍞 Anchor: When an LLM gets a math answer right, it receives a positive reward, nudging it to use similar steps in future problems.

🍞 Hook: You know how you might ask more questions to the students who probably have useful answers? 🥬 The Concept (Importance Sampling): It’s a way to reuse data collected under one policy to estimate what would happen under another. How it works: (1) compute an importance weight W = (current policy probability) / (old policy probability) for each whole response; (2) use W to reweight the gradient; (3) update the model. Why it matters: without importance sampling, off-policy updates would be biased and lead to wrong learning signals. 🍞 Anchor: If a response was likely under yesterday’s policy but not today’s, we discount it; if it was more likely under today’s policy, we emphasize it.

🍞 Hook: Imagine using an old game guide on a new level—it’s out of date. 🥬 The Concept (Policy Staleness): It’s when the data was generated by an older (stale) policy, not the current one. How it works: (1) roll out many samples; (2) split into mini-batches; (3) by the time you update later mini-batches, the policy has changed, so their data is stale. Why it matters: stale data can push training in the wrong direction, even causing collapse. 🍞 Anchor: In asynchronous systems, rollouts can lag multiple updates behind, making this effect bigger.

🍞 Hook: If your scores swing wildly from game to game, it’s hard to tell if you’re improving. 🥬 The Concept (Variance Reduction): It’s about making estimates steadier so updates are reliable. How it works: (1) identify where randomness explodes (long sequences make W huge or tiny); (2) smooth or reshape the weights; (3) keep signal but reduce noise. Why it matters: without variance reduction, sequence-level importance sampling can blow up and crash training. 🍞 Anchor: Averaging over many games (or shrinking outliers) makes it clearer whether a new strategy works.

Before VESPO, people tried two main families of fixes, both with downsides.

🍞 Hook: Imagine editing every word in a story to fit a rule, even though the whole story’s meaning is what matters. 🥬 The Concept (Token-Level Clipping): It limits the importance ratio for each token separately. How it works: (1) compute ratio per token; (2) clip it if it’s too big/small; (3) treat each token’s gradient independently. Why it matters: it avoids the product of many ratios exploding, but it breaks the whole-sequence structure and is only a first-order approximation. 🍞 Anchor: Clipping tokens keeps updates small but can miss how earlier steps change the meaning of later steps.

🍞 Hook: Turning down the volume so every song sounds equally loud can hide the special parts of a track. 🥬 The Concept (Sequence-Level Normalization): It scales the whole-sequence weight by length (like geometric mean) to tame variance. How it works: (1) take per-token log-ratios; (2) average them; (3) exponentiate to get a length-normalized weight; (4) often still clip. Why it matters: it reduces variance, but introduces length-dependent bias that favors long outputs and can distort learning. 🍞 Anchor: Two sequences with the same average per-token score but very different lengths may get the same weight, even though their true importance should differ.

The big pain point was a lack of a single, principled recipe for how to reshape these importance weights. Heuristics worked sometimes, but failed under heavy staleness, in asynchronous setups, or in Mixture-of-Experts (MoE) models where tiny numerical differences cause very different routing through experts. Real systems need something that stays stable when rollouts lag, batching is large, and training/inference engines don’t match perfectly.

02Core Idea

🍞 Hook: You know how choosing the right measuring cup makes baking easier and more consistent? 🥬 The Concept (Proposal Distribution / Measure Change): Any way you reshape importance weights is secretly the same as choosing a new distribution Q to measure with. How it works: (1) start from behavior data μ; (2) pick a transformation φ(W); (3) this defines a new proposal Q ∝ μ · φ(W); (4) your gradient becomes an on-policy gradient under Q (up to a constant). Why it matters: instead of guessing φ(W), we can design Q with clear goals—stay near μ for sample efficiency, lean toward π (current policy) for better learning, and limit variance. 🍞 Anchor: It’s like switching to a better ruler so your measurements are both trustworthy and convenient.

The Aha! in one sentence: If weight reshaping equals choosing Q, then we can design Q by solving a clean optimization problem and read off the best φ(W) in closed form.

Three analogies for the same idea:

The librarian: You have two shelves, old favorites (μ) and new picks (π). You choose a reading list Q that mixes both shelves while avoiding any single book dominating the list (variance control).
The thermostat: You blend mom’s preference (μ) and dad’s preference (π), while a safety guard prevents the temperature from swinging wildly (variance bound).
The traffic officer: Cars (samples) from different roads (μ vs π) must merge smoothly; no car should speed so fast it endangers flow (softly reduce extreme weights).

Before vs After:

Before: Heuristics like token clipping or length-normalized sequence weights; they can work but may bias updates or ignore cross-token effects.
After: A principled variational design that balances closeness to μ and π and caps variance, yielding a smooth, sequence-level kernel with no length normalization.

Why it works (intuition without equations):

Sequence weights multiply per-token ratios, so they can explode for long answers; we need a way to softly pull extreme weights back toward useful ranges.
We pose: find Q that (1) stays near μ (for sample efficiency), (2) moves toward π (for better learning), and (3) keeps the expected weight under Q bounded (for variance control).
Solving this gives a kernel φ(W) that grows like a power (helps moderate small W) but decays exponentially for large W (shrinks huge weights), and is smooth everywhere.

🍞 Hook: Think of using a gentle mold that shapes clay into a neat figure without sharp cuts. 🥬 The Concept (Closed-Form Reshaping Kernel): It’s the exact, math-derived function for reweighting sequence-level importance W. How it works: (1) set an objective that mixes closeness to μ and π; (2) add a variance constraint; (3) solve it to get φ(W) ≈ W^α · exp(−λ W); (4) shift parameters so φ(1)=1 and use different settings for positive vs negative advantages. Why it matters: now we have a smooth, theory-backed alternative to hard clipping and no need for length normalization. 🍞 Anchor: Near W=1, it keeps full strength; for huge W, it gently decays; for tiny W, it doesn’t over-punish.

🍞 Hook: Imagine a coach who adjusts practice intensity based on how aligned a play is with the new strategy—rewarding helpful plays and easing off on extreme outliers. 🥬 The Concept (VESPO): Variational Sequence-Level Soft Policy Optimization is the algorithm that applies the closed-form kernel directly to whole responses. How it works: (1) compute W from sequence log-probabilities under current policy vs behavior; (2) compute advantage A with a baseline; (3) scale the policy gradient by φ(W) that’s smooth, sequence-level, and asymmetric for positive vs negative A; (4) implement everything in log-space for stability. Why it matters: it preserves inter-token dependencies, avoids length bias, stabilizes under staleness/asynchrony, and improves results across dense and MoE models. 🍞 Anchor: On-policy-like samples (W≈1) get full credit; wildly off-policy ones are softly toned down so they don’t derail training.

03Methodology

At a high level: Input (queries x and behavior-policy responses y, plus rewards) → compute sequence-level importance weight W and advantage A → apply the VESPO kernel φ(W) to scale the gradient → update the policy.

Step-by-step recipe:

Gather data (off-policy trajectories)

What happens: Use a behavior policy μ (e.g., yesterday’s checkpoint or a different engine) to generate responses y for prompts x, and assign a sequence-level reward R(τ) per response.
Why this step exists: Real systems split big rollouts into mini-batches and run asynchronously; so we naturally get off-policy data we want to reuse well.
Example: For a math problem, the model writes a 4,000-token solution; a verifier checks if the final answer is correct and gives reward 1 or 0.

Compute sequence-level importance weight W

What happens: For each response, compute log W = Σ_t [log πθ(yt|x, y<t) − log μ(yt|x, y<t)], then W = exp(log W). Do this in log-space for numerical stability.
Why this step exists: W tells us how representative this trajectory is under the current policy vs the behavior policy; it’s the key to off-policy correction.
What breaks without it: Gradients become biased toward μ; we might learn the wrong thing.
Example: If the current policy thinks this exact solution is 4× more likely than μ did, then W≈4.

Compute advantage A with a simple baseline

What happens: Use A(τ) = R(τ) − b, where b is the mean reward within the prompt group (same as GRPO’s grouping). This centers rewards around zero.
Why this step exists: Centering reduces variance and prevents the model from overreacting to uniform shifts.
What breaks without it: Unnecessary noise; gradients point in unhelpful directions.
Example: If most solutions score 0.3 on average and this one scored 1.0, then A=0.7.

Apply VESPO’s smooth sequence-level kernel

What happens: Use φ(W) = W^c1 * exp(c2 * (1 − W)), with φ(1)=1. Choose asymmetric (c1, c2) for A≥0 vs A<0 (e.g., stronger suppression for negative A when W<1).
Why this step exists: It softly reduces extreme weights (both huge and tiny), but keeps on-policy-like samples strong; no length normalization is used.
What breaks without it: (a) With hard clipping, updates jump at boundaries; (b) with length normalization, long responses can dominate and cause collapse; (c) with token-only tricks, we lose sequence structure.
Example: If W=0.2 and A<0, the power term W^c1 naturally down-weights it; if W=10 and A>0, the exponential exp(c2*(1−W)) gently pulls it back.

Form the policy gradient and update

What happens: Use REINFORCE style: gradient ∝ φ(W) · A · ∇ log πθ(τ). Detach φ(W) so it just scales the gradient; compute in log-space to avoid overflow.
Why this step exists: It gives a stable, variance-reduced update that preserves whole-sequence credit assignment.
What breaks without it: Extreme gradients from rare trajectories can dominate; learning becomes unstable.
Example: Two responses with the same A but wildly different W won’t both dominate—VESPO softly rebalances them.

Implementation details that matter (the secret sauce)

Log-space everything: Sum per-token log-probs; exponentiate only at the end to get W; also compute the kernel’s log form first.
Sequence-level, no length normalization: Preserves inter-token dependencies and avoids length bias (no 1/T or sqrt(T)).
Asymmetric hyperparameters: Slightly stronger decay for negative-advantage samples (A<0) with W<1 prevents over-penalizing what the policy already dislikes.
Compatible with engineering fixes: Plays well with routing replay (R2) or truncated IS if you need them, but often stable even without.

Concrete mini-example with numbers:

Suppose W=1.0 (on-policy), A=+0.5. With φ(1)=1, the update strength is unchanged: good samples get full credit.
Suppose W=7.0, A=+0.5. The exponential part exp(c2*(1−7)) shrinks the gradient smoothly—no hard cutoff—so one rare, huge-weight sample can’t hijack training.
Suppose W=0.1, A=−0.4. The power part W^c1 (with larger c1 for negative A) down-weights it, preventing the model from overreacting to something it already thinks is unlikely.

What makes this method clever:

It reframes weight reshaping as choosing a proposal distribution Q and solves for the best φ(W) with a principled variational objective.
It removes the need for length normalization—avoiding a known source of length bias and instability—yet still squashes variance.
It is smooth everywhere, preventing brittle behavior around clip thresholds, and it keeps the whole-sequence story intact.

04Experiments & Results

The tests: The authors stress-test stability and performance where off-policy shift really happens.

Policy staleness: Split a big rollout into N mini-batches (gbs/mbs=N) so later updates use staler data. They test N in {4, 8, 16, 32, 64}.
Fully asynchronous: Rollout and training run on separate node groups; parameters sync every 4 local updates; in-flight stale rollouts are kept.
Train–inference mismatch: Different engines (e.g., FSDP/Megatron for training vs vLLM for serving) and, for MoE models, routing divergence.

Benchmarks and metrics:

Datasets: AIME 2024, AIME 2025, AMC 2023, MATH-500 for math reasoning; report avg@k accuracy (k=32 for AIME, 16 for AMC23, 4 for MATH500).
Models: Llama-3.2-3B-Instruct (dense), Qwen3-8B-Base (dense), Qwen3-30B-A3B-Base (MoE).
Baselines: GRPO (token-level clipping), GSPO (sequence-level with length normalization and clipping), SAPO (soft adaptive gating). VESPO uses asymmetric (c1, c2) per sign of A.

Scoreboard highlights (context, not just numbers):

With gbs/mbs=8, VESPO achieves the best average accuracy across all three models. On the challenging MoE model Qwen3-30B-A3B-Base, it reaches about 66.9% average—about a solid A—while the next best sits near 55.3–54.9% (mid B), a clear and practical margin for users.
Robustness to staleness: As N grows from 4 to 64, VESPO’s training reward curves stack almost on top of each other and final accuracies stay strong. In contrast, GRPO saturates earlier and weakens with higher N; GSPO degrades with N and even collapses at some settings; SAPO becomes unstable and can fully collapse as N increases.
Fully asynchronous: VESPO keeps rollout KL, log perplexity, policy gradient loss, and gradient norms calm and near zero variance, while reaching the highest training reward and top AIME scores. SAPO collapses; GRPO shows spiky, unstable dynamics; GSPO is stable but underperforms.
Train–inference mismatch: VESPO, even without special fixes, matches or beats GRPO variants that add Truncated IS or Router Replay (R2). When you combine VESPO with R2, you get the best of both worlds—the strongest training reward and best AIME25 accuracy in these tests.

Surprising findings:

No length normalization is better: Adding sqrt(T) or T normalization to VESPO hurts stability and can cause collapse. The pure sequence-level kernel (no length norm) is most stable and highest-reward.
Asymmetry helps: Using stronger suppression for negative-advantage samples avoids over-penalizing unlikely sequences and prevents length explosion.
MoE friendliness: VESPO’s soft, sequence-aware shaping is particularly effective where routing mismatches are common—exactly where many methods struggle.

Bottom line: Across staleness up to 64×, fully asynchronous pipelines, and mismatched engines, VESPO is the most stable approach and delivers the best or tied-best accuracy—especially on the hard MoE model where stability matters most.

05Discussion & Limitations

Limitations:

Hyperparameter tuning: The (c1, c2) settings, along with asymmetry between positive and negative advantages, matter. While VESPO is robust to moderate changes, extreme or poorly chosen values can slow learning or reintroduce instability.
Reward design: As a policy-gradient method with sequence-level rewards, its performance depends on having reliable, low-noise rewards (e.g., math verifiers). Very noisy rewards may still challenge stability.
Domain coverage: Experiments focus on mathematical reasoning; generalization to other domains (e.g., dialogue safety, long-form writing) should be validated.
Theoretical knobs vs practical choices: The variational derivation yields a kernel family; picking exact parameters per-task is still empirical.

Required resources:

Hardware: Experiments used 32 NVIDIA H20 GPUs, long contexts (up to 16k tokens), and frameworks like vLLM for inference and FSDP/Megatron for training.
Logging and recomputation: You need per-token log-probabilities under both current and behavior policies (stored or recomputed) to form W in log-space.
Infrastructure: For asynchronous tests, separate rollout and training clusters plus synchronization logic are required.

When not to use:

Near-on-policy, short-sequence tasks where simple PPO/GRPO already trains stably and cheaply.
Extremely noisy or adversarial reward settings where any policy-gradient method struggles without additional variance reduction or critics.
Very tiny datasets where added complexity may not justify benefits.

Open questions:

Adaptive schedules: Can we auto-tune (c1, c2) or learn them from variance targets (e.g., maintain a desired effective sample size)?
Value functions: How does VESPO interact with learned critics or variance-reduced baselines beyond group means?
Broader domains: Will the same stability gains appear in code generation, multi-turn agents with tools, or safety alignment tasks?
Offline RL: Can the measure-change view and soft kernel extend to purely offline datasets with severe distribution shift?
Theory: Can we turn the variance constraint into measurable, per-batch guarantees or adaptive Lagrange multipliers in practice?

06Conclusion & Future Work

Three-sentence summary: VESPO reframes importance-weight reshaping as choosing a proposal distribution and solves it with a variational objective that balances closeness to behavior and target policies under a variance constraint. This yields a smooth, closed-form, sequence-level kernel that avoids length normalization while softly suppressing extreme weights. In challenging settings—high staleness, asynchronous pipelines, train–inference mismatch, and MoE routing—VESPO trains stably and outperforms strong baselines on math reasoning.

Main achievement: A principled, closed-form, sequence-level soft policy optimization kernel (approximately φ(W)=W^c1·exp(c2·(1−W))) that stabilizes off-policy LLM RL without length normalization, preserving inter-token dependencies.

Future directions: Scale to larger asynchronous clusters; extend to multi-turn agent RL with tools; integrate with adaptive hyperparameter schedules or variance targets; combine with critics and routing-replay; apply to on-policy distillation and offline RL.

Why remember this: It gives a unifying, theory-backed lens—“reshaping = measure change”—that explains past heuristics and provides a reliable, smooth alternative. In practice, it makes real-world RL for LLMs more robust and scalable, especially where off-policy shift and mismatches are unavoidable.

Practical Applications

•Stabilize large-batch, mini-batch-split RL training where later mini-batches are increasingly off-policy.
•Run fully asynchronous RL pipelines (separate rollout and training clusters) without throwing away stale data.
•Train MoE models with reduced risk of collapse from routing mismatches by softly suppressing extreme weights.
•Replace token-level clipping with sequence-level soft shaping to preserve inter-token dependencies.
•Avoid length normalization and its bias, preventing runaway long outputs and instability.
•Combine VESPO with router replay (R2) to further increase stability and peak accuracy on MoE systems.
•Implement log-space computation of W and φ(W) to prevent numerical overflow in long-sequence training.
•Use asymmetric hyperparameters to guard against over-penalizing negative-advantage samples with W<1.
•Monitor rollout KL, log-perplexity, gradient norms, and response length; VESPO should keep them steady.
•Apply the measure-change view to design new weight-shaping variants tailored to your reward noise profile.

Version: 1