Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Shuo He; Lang Feng; Xin Cheng; Lei Feng; Bo An

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Intermediate

Shuo He, Lang Feng, Xin Cheng et al.2/11/2026

arXiv

Key Summary

•Training big language models with reinforcement learning can wobble because the per-token importance-sampling (IS) ratios swing wildly.
•Old fixes either flatten everything with one sequence-wide ratio or tweak each token alone, ignoring how neighboring tokens are related over time.
•This paper models the 'right' IS ratio as a hidden value that gently changes from token to token and tracks it with an online, causal Kalman filter.
•The filter smooths noisy spikes but still keeps meaningful local ups and downs across nearby tokens.
•Using these filtered ratios (KPO) makes policy updates steadier and more effective, especially in long math reasoning.
•Across six tough math benchmarks, KPO beats strong methods like GRPO, GSPO, and GMPO on most scores.
•Clipping plus Kalman filtering (KPO-clipped) is usually best, balancing stability and learning strength.
•Analysis shows fewer chaotic token switches, longer coherent runs, and much more low-frequency (smooth) signal after filtering.
•KPO is lightweight to add but processes tokens sequentially, which is hard to parallelize.
•This approach points toward safer, more reliable reinforcement learning for reasoning-heavy LLMs.

Why This Research Matters

Stable reinforcement learning makes large language models more dependable in the tasks we care about most, like multi-step math, coding, and planning. By smoothing noisy token weights while keeping local structure, KPO helps models maintain consistent chains of thought instead of wobbling from word to word. This can translate into fewer failures, better safety, and more trustworthy outputs in day-to-day apps. Developers benefit from faster convergence and less risk of training collapse, saving time and compute. Users benefit from assistants that reason more clearly and make fewer strange jumps. Over time, this kind of stability-first thinking can improve not just math solvers, but any system where step-by-step logic matters.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine reading a long math solution out loud. If you shout and whisper at random on every single word, the whole thing feels chaotic. But if your loudness changes smoothly, people can follow your thinking.

🥬 The Concept: Before this paper, training large language models (LLMs) with reinforcement learning (RL) often felt like that chaotic reading. The per-token importance-sampling (IS) ratios—the weights that tell learning how much to trust each token—could spike and crash from one token to the next, making training unstable.

How it worked before:

Reinforcement learning for LLMs (like PPO/GRPO) updates a policy using tokens from an older policy and corrects for the mismatch using an IS ratio per token.
In practice, these per-token ratios can have huge variance: some tokens get overly loud (very high ratio), others too quiet (very low), making gradients jumpy and training brittle.
To calm things down, people tried two main fixes: (a) replace all token ratios with one smooth, sequence-level ratio (GSPO/GMPO), or (b) adjust each token’s ratio independently with soft gates or asymmetric handling.

Why it matters: When the IS ratios are too noisy, gradients become erratic, entropy can collapse, and training can even fail. That means worse reasoning, more cost, and less reliable AI.

🍞 Anchor: Think of a class reading a book. If every child randomly shouts or whispers per word, the teacher can’t track who’s reading well. But if there’s a calm, fair way to adjust everyone’s volume based on recent behavior, the class stays in sync and improves together.

—

🍞 Hook: You know how friends walking together naturally match pace for a while before changing speed? Neighboring steps are related; they don’t usually lurch every second.

🥬 The Concept: Off-policy deviation—the difference between the new policy and the old one—should also show local smoothness across adjacent tokens within the same reasoning segment. But the paper discovers that raw token-level IS ratios are structurally inconsistent: lots of short runs, frequent switches, and growing off-policy spikes later in sequences.

How it works (the problem):

Measure per-token IS ratios during GRPO training.
Count how often tokens are off-policy, how long off-policy runs last, and how frequently the state switches.
Find: off-policy tokens are short-lived, switch often, and occur more later in the sequence—suggesting noisy, weak local coherence.

Why it matters: If adjacent tokens don’t carry coherent weights, the policy update can push and pull in conflicting directions, distorting learning and risking collapse.

🍞 Anchor: If one friend sprints for one step, then stops for one step, then sprints again, the group gets tangled. A stable walk needs short-term coherence, not wild step-by-step swings.

—

🍞 Hook: Imagine two common band-aids: (1) average everyone’s volume into one number (smooth but loses detail), or (2) fix each person separately (detailed but ignores group flow).

🥬 The Concept: Sequence-level ratios give global smoothness but erase local structure; token-by-token tweaks keep detail but ignore time structure. The missing piece is a method that’s both smooth and locally structure-aware.

How it works (the gap):

One global number per sequence reduces variance but treats all tokens the same.
Per-token gates or flips treat tokens in isolation, losing how neighbors relate.
We need a method that respects the fact that nearby tokens often belong to the same reasoning step.

Why it matters: Without local structure awareness, we either over-smooth and ignore important differences or under-smooth and keep the chaos.

🍞 Anchor: It’s like mixing paint. Stirring everything until uniform erases the picture; poking random dots doesn’t create a pattern. You need smooth brushstrokes that follow the shapes.

—

🍞 Hook: Picture a speedometer that jitters wildly on a bumpy road. A good dashboard filters the noise so the needle moves smoothly but still reacts to real speed changes.

🥬 The Concept: The paper fills the gap with KPO—an online, causal Kalman filter that treats the desired IS ratio as a hidden state that drifts gently across tokens and is observed through noisy measurements.

How it works:

Model the true (latent) log-IS ratio as smoothly evolving over tokens.
Treat the raw log-IS ratio as a noisy observation of that true value.
Use an online, past-only (causal) Kalman filter to update the estimate each token: predict, compute adaptive gain, then update.

Why it matters: This preserves local structure (neighbors inform each other) while crushing noise spikes, leading to steadier, more effective training.

🍞 Anchor: It’s like having a smart cruise control that listens to recent speed and road feel, then adjusts smoothly, keeping the car stable without lagging behind real changes.

—

🍞 Hook: Why should anyone care? Because wobbly training means worse answers in everyday tasks—math homework help, coding hints, or planning steps for a project.

🥬 The Concept: Stable policy optimization isn’t just a math trick; it makes LLMs more reliable in long, multi-step reasoning where small mistakes can snowball.

How it works in life:

In math: fewer random jumps means more consistent chains of thought.
In coding: steadier updates help the model follow logical steps without derailing.
In agents: consistent policies avoid risky overreactions from one moment to the next.

Why it matters: Better stability means better results, safer behavior, and more trust in AI systems we use daily.

🍞 Anchor: Like building a tower of blocks, if your hand shakes wildly for each block, the tower falls. If your hand is steady but responsive, you can stack higher and safer.

02Core Idea

🍞 Hook: You know how a best friend can guess your next word because you’ve been talking about the same topic for a while? Conversations have short-term flow.

🥬 The Concept: The key insight—model the per-token importance-sampling ratio as a hidden, smoothly changing signal over time and estimate it online with a causal Kalman filter that only uses past and current tokens.

How it works:

Think in log space so very big or small ratios don’t blow up numerically.
Predict the next latent ratio from the last one (it usually changes slowly).
Compare the noisy observation (raw log-ratio) to the prediction.
Use an adaptive gain to decide how much to trust the new observation.
Update the estimate and move on to the next token.

Why it matters: Without this, token weights can whiplash—too much push here, sudden pull there—hurting stability. With it, updates respect local structure but ignore jitter.

🍞 Anchor: It’s like smoothing a shaky video: the scene still changes, but you remove the hand tremors so viewers don’t get dizzy.

—

Three analogies to deepen the idea:

Thermostat analogy

🍞 Hook: Imagine a smart thermostat that doesn’t blast heat at every tiny temperature wobble.
🥬 The Concept: The latent ratio is your true room temperature; the raw ratio is a noisy thermometer reading. The Kalman filter fuses yesterday’s estimate with today’s noisy reading to decide how much to adjust.
🍞 Anchor: On a windy day, the thermometer flickers, but your room stays comfy because the filter trusts the trend, not each flicker.

Cruise control analogy

🍞 Hook: Cruise control smooths your speed even if the road bumps your car.
🥬 The Concept: The car’s true speed changes slowly; the speedometer can jitter. The filter updates your speed estimate gently, so acceleration is steady, not jumpy.
🍞 Anchor: You don’t jerk the gas pedal with every bump; you glide.

Word-highlighter analogy

🍞 Hook: When reading, your eyes linger a bit on clusters of important words.
🥬 The Concept: Neighboring tokens often share the same reasoning step, so their ratios should move together. The filter highlights coherent stretches instead of random single-word pops.
🍞 Anchor: In “Find the capital of France,” attention clusters on “capital” and “France,” not on a random “the.”

—

Before vs. After:

Before: Raw token ratios are jumpy; sequence-wide averages over-smooth and erase detail; per-token gates miss time structure.
After: Filtered token ratios are smooth but keep local up/down patterns, leading to steadier gradients, preserved entropy, and higher scores.

Why it works (intuition):

Most real changes in off-policy behavior happen over short stretches (reasoning steps), not every single token.
The Kalman filter’s adaptive gain automatically trusts the observation more when the model is uncertain and the observation seems reliable—and trusts history more when the observation looks noisy.
Working in log space naturally balances big and small ratios and prevents multiplicative blow-ups.

Building Blocks:

🍞 Hook: You know how detectives combine clues and prior knowledge?
🥬 The Concept: KPO combines a few simple pieces:
1. Latent state (the smooth, true log-IS ratio).
2. Process noise Q (how much the latent can wiggle per token).
3. Observation noise V (how noisy the raw ratio is).
4. Kalman gain (how much to adjust toward the new reading).
5. Causality (only past/current tokens are used—just like autoregressive generation).
Why it matters: Each piece prevents a specific failure—no state means no structure, no Q means no adaptation, no V means over-trusting noise, no gain means no balance, no causality means peeking into the future (cheating).
🍞 Anchor: It’s like making a smoothie: the fruit (observations), yogurt (prior), blender speed (gain), and thickness (Q/V) must be balanced for a tasty, consistent result.

03Methodology

At a high level: Input → compute raw per-token log-ratios → causal Kalman filter (predict → gain → update) → exponentiate to get filtered ratios → plug into GRPO-style objective (clipped or not) → output stable policy updates.

Step-by-step recipe:

Input and preprocessing

What happens: For each sequence, compute the raw token-level log-IS ratio $z_t$ = log π_new( $y_t$ | context) − log π_old( $y_t$ | context) for t = 1...T.
Why this step exists: Log space turns products into sums and avoids numerical blow-ups when ratios are very large/small.
Example: Suppose three tokens have raw probabilities: old policy $\begin{pmatrix} 0.20 \\ 0.10 \\ 0.05 \end{pmatrix}$ , new policy $\begin{pmatrix} 0.22 \\ 0.08 \\ 0.07 \end{pmatrix}$ . Then ratios r= $\begin{pmatrix} 1.10 \\ 0.80 \\ 1.40 \end{pmatrix}$ , and log-ratios $z≈[0$ .095, −0.223, 0.336].

Initialize the filter

What happens: Set the initial latent estimate ρ̂_0|0 = 0 ( $ratio≈1$ ) and uncertainty $P_0$ |0 to a small positive value. Choose process noise Q and observation noise V.
Why this step exists: The filter needs a starting guess and a knob (Q/V) to trade off smoothness vs. responsiveness.
Example: Use Q=1e−6, V=1 for clipped KPO (very smooth); Q=1e−4, V=1 for unclipped KPO (a bit more responsive).

Prediction (for each token t)

What happens: Assume the latent log-ratio drifts slowly: predict ρ̂_t|t−1 = ρ̂_{t−1|t−1}. Increase uncertainty: $P_t$ |t−1 = $P_{t−1|t−1}$ + Q.
Why this step exists: Encodes the belief that true off-policy behavior doesn’t flip wildly at each token but can change gradually.
Example: If the last estimate was 0.05 with small uncertainty, the next prediction stays near 0.05 but with slightly more uncertainty.

Compute the innovation and Kalman gain

What happens: Innovation δ_t = $z_t$ − ρ̂_t|t−1 measures surprise; gain $K_t$ = $P_t$ |t−1 / ( $P_t$ |t−1 + V) sets how strongly to trust $z_t$ .
Why this step exists: The gain is the secret sauce: if the observation looks reliable (low V, high P), trust it more; if it looks noisy, trust history more.
Example: If δ_t is large but V is also large, $K_t$ stays small, so you don’t chase likely noise spikes.

Update the estimate

What happens: Correct the prediction: ρ̂_t|t = ρ̂_t|t−1 + $K_t$ δ_t. Reduce uncertainty: $P_t$ |t = (1 − $K_t$ ) $P_t$ |t−1.
Why this step exists: This fuses the past (prediction) with the present (observation) in an adaptive, mathematically optimal way (for Gaussian assumptions).
Example: If $z_t$ ≈ 0.33, ρ̂_t|t− $1 ≈ 0$ .05, and $K_t$ ≈ 0.2, then ρ̂_t| $t ≈ 0$ .05 + 0.2*(0. $28) ≈ 0$ .106—moved toward the observation but not all the way.

Map back to ratio space

What happens: Set the filtered ratio ê_t = exp(ρ̂_t|t).
Why this step exists: Training objectives use ratios, not logs; exponentiation converts the smooth log estimate into a positive ratio.
Example: If ρ̂_t|t = 0.106, ê_ $t ≈ 1$ .11.

Plug into the policy loss

What happens: Replace raw $r_t$ $r_{t}$ with filtered ê_t in a GRPO-style objective. Use either:
- Clipped variant: min(ê_t $A_t$ , clip(ê_t, 1 − ε−, 1 + ε+) $A_t$ ).
- Unclipped variant: simply ê_t $A_t$ .
Why this step exists: The filter reduces noise; clipping further tames rare extremes. Together, they steady gradients while keeping fine-grained credit assignment.
Example: If ê_t=1.11 and $A_t$ positive, the update nudges the token probability up; if ê_t were very large, clipping limits the push.

The secret sauce:

Causality: Only past/current tokens are used, matching autoregressive generation and avoiding lookahead bias.
Local structure awareness: Neighboring tokens inform each other, creating long coherent segments rather than ping-ponging states.
Adaptive gain: The filter automatically dials up or down how much to trust observations, depending on noise.
Log-space stability: Prevents multiplicative explosions and treats big/small ratios symmetrically.

Concrete mini example (3 tokens):

Raw log-ratios z: [0.095, −0.223, 0.336]. Start ρ̂_0|0=0.
t=1: predict 0 → observe 0.095; modest K gives ρ̂_1| $1 ≈ 0$ .06 → ê_ $1 ≈ 1$ .06.
t=2: predict 0.06 → observe −0.223; K nudges down: ρ̂_2|2 ≈ −0.01 → ê_ $2 ≈ 0$ .99.
t=3: predict −0.01 → observe 0.336; K moves up: ρ̂_3| $3 ≈ 0$ .05 → ê_ $3 ≈ 1$ .05. Result: Smooth, coherent ratios around 1, not wild swings.

What breaks without each step:

No log transform: numeric instability on extremes.
No prediction: ignores local continuity; becomes jittery.
No gain: either over-trusts noise or never adapts.
No update: never learns from data.
No clipping (when used): rare extremes can still dominate.

Putting it together: The pipeline turns noisy, high-variance token weights into calm, structure-aware signals that make policy optimization stable and effective.

04Experiments & Results

🍞 Hook: Imagine a spelling bee where each student gets 16 tries to spell a word. We track both how many words they ever spell right (pass@16) and how often they spell right on average (avg@16). That gives a fair picture of skill and consistency.

🥬 The Test: The authors train Qwen3-4B on tough math reasoning and evaluate on six benchmarks (AIME’24, AIME’25, AMC’23, MATH500, Minerva, OlympiadBench). For each problem, the model produces 16 solutions. Metrics are:

pass@16: at least one correct among 16 attempts.
avg@16: average correctness across all 16 attempts (measures consistency). Why it matters: pass@16 is like managing to get it right eventually; avg@16 is about being reliably right.

🍞 Anchor: If a student sometimes gets it right only after many tries, pass@16 can be okay, but avg@16 reveals they’re still inconsistent. We want both high.

—

The competition (baselines):

GRPO: token-level ratios (no special smoothing).
GSPO/GMPO: flatten to sequence-level ratios (very smooth but lose local detail).
KPO (ours): token-level Kalman-filtered ratios, with unclipped and clipped variants.

Scoreboard highlights (higher is better):

AIME’24: GSPO avg@16=32.70; KPO-clipped=37.91 (a clear jump), KPO-unclipped pass@16=66.67 (top).
AIME’25: GSPO avg@16=29.16; KPO-clipped=36.87, pass@16=60.00 (both best).
AMC’23: GSPO pass@16=95.00 (tied with KPO); KPO-clipped avg@16=87.50 (best), showing greater consistency.
MATH500: KPO-clipped avg@16=89.42 and pass@16=94.80 (best).
Olympiad: KPO-clipped avg@16=54.06 and pass@16=66.27 (best).
Minerva: KPO-unclipped avg@16=39.15 (best); GMPO is slightly higher on pass@16. Context: These gains are like moving from a B to an A range, especially on the hardest tests (AIME/Olympiad), where stability matters most.

Surprising findings:

KPO-clipped usually beats KPO-unclipped: Even after smoothing ratios, occasional truly extreme tokens exist; clipping safely limits their impact while keeping filtered structure.
Training stability: Plots show KPO’s reward rising steadily, entropy staying healthy (avoids early collapse), moderate clip fractions, and low-variance policy loss. GRPO shows instability; GSPO/GMPO are stable but plateau earlier.
Token dynamics transform: After filtering, tokens form long coherent runs (Up/Down/On), switch less frequently (~0.43 → ~0.01), and the signal becomes almost entirely low-frequency (LFR ~0.98). That’s evidence the filter removed jitter while preserving meaningful trends.
Tuning Q/V matters: Stronger smoothing (smaller Q/V like 1e−6) yields higher rewards and stable clip fractions. Too weak smoothing (1e−2) lets noise leak back, hurting performance.

Why the results matter:

On long, multi-step reasoning, small instabilities multiply. KPO’s local structure-aware smoothing cuts variance where it hurts most, letting the model learn deeper chains of thought.
Keeping per-token detail (vs. full sequence flattening) retains fine-grained credit assignment, which helps consistency (avg@16) while also boosting best-case success (pass@16) on many tasks.

Takeaway: Across tough math benchmarks, KPO delivers both steadier training and better final scores than popular baselines, especially when paired with clipping for a robust bias–variance trade-off.

05Discussion & Limitations

Limitations:

Autoregressive filtering: The Kalman update is sequential by design; it’s hard to parallelize across tokens the way we can with some other computations. This may limit speed for very long sequences.
Hyperparameter sensitivity: Choosing Q and V (process and observation noise) affects the smoothness–responsiveness balance. Poor tuning can under-smooth (noisy) or over-smooth (laggy) the ratios.
Residual extremes: Filtering reduces spikes but doesn’t delete meaningful extremes; clipping is still helpful in practice.
Modeling simplicity: A 1D random walk is simple and robust, but it can miss richer structures (e.g., segment boundaries or semantic shifts) that more complex state-space models might capture.

Required resources:

Standard RLHF/GRPO training setup with access to old and new policy token log-probabilities.
Minimal extra compute for the per-token scalar filter (tiny memory and compute overhead).
Some time to tune Q/V and clipping thresholds per domain/model.

When NOT to use:

If your training is already fully on-policy (ratios near 1 and stable), filtering adds little benefit.
If per-token structure truly isn’t locally coherent (e.g., highly erratic, intentionally randomized tokenization), the temporal assumption breaks.
If you require strict parallelization across tokens for speed constraints and can’t afford any per-token sequential step.

Open questions:

Can we learn Q and V online from data or make them token/context dependent (e.g., higher Q in regions with known MoE routing changes)?
Could segment-aware or content-aware state-space models further improve local coherence without lag?
Is there a parallelizable approximation of causal filtering that preserves most benefits?
How does KPO interact with other stabilizers (e.g., dynamic clipping, uncertainty-aware reweighting) across broader tasks like coding, dialogue safety, or tool-use agents?
Can we extend from 1D ratios to small vector states that encode additional local signals (e.g., advantage trends, entropy signals) for even smarter smoothing?

06Conclusion & Future Work

Three-sentence summary:

The paper shows that raw, per-token importance-sampling ratios in RL for LLMs are locally inconsistent and noisy, which destabilizes training.
It proposes KPO, an online, causal Kalman filter that treats the desired log-IS ratio as a smoothly evolving hidden state, producing token-wise ratios that are both smoothed and locally coherent.
Across challenging math benchmarks, KPO improves stability and performance over strong baselines, especially when combined with clipping.

Main achievement:

Turning IS-ratio handling from a per-token or per-sequence static choice into a time-aware, locally coherent estimation problem solved by causal Kalman filtering.

Future directions:

Learn or adapt Q/V on the fly; add segment/semantic awareness; explore parallel-friendly approximations; integrate with uncertainty-driven or dynamic clipping methods; extend beyond math reasoning to agents and code.

Why remember this:

KPO reframes a noisy training headache as a simple signal-tracking problem. By respecting time and local structure, it steadies gradients without erasing detail—an idea likely to generalize wherever token-level decisions unfold over coherent stretches.

Practical Applications

•Train math-reasoning LLMs that keep coherent chains of thought without collapsing entropy.
•Stabilize RLHF for long-form reasoning tasks (proof-style problems, multi-step derivations).
•Improve consistency in code generation by smoothing token-level updates during RL fine-tuning.
•Reduce instability in Mixture-of-Experts models where routing can cause ratio spikes.
•Enhance agent planning and tool-use sequences by preserving local structure across action tokens.
•Mitigate off-policy variance in mini-batch updates for scalable RL training pipelines.
•Pair with clipping to safely limit rare extremes while retaining fine-grained credit assignment.
•Tune Q/V for domains with different noise (e.g., higher Q in fast-changing MoE scenarios).
•Adopt KPO as a drop-in ratio smoother for GRPO-style objectives with minimal overhead.
•Use filtered ratios to avoid entropy collapse and maintain exploration in long horizons.

Version: 1