Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Kirill Pavlenko; Alexander Golubev; Simon Karasik; Boris Yangel

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Intermediate

Kirill Pavlenko, Alexander Golubev, Simon Karasik et al.2/10/2026

arXiv

Key Summary

•The paper fixes a common mistake in training language models for multi-part tasks: giving the same reward signal to every token, even when different text parts aim at different goals.
•It introduces Blockwise Advantage Estimation (BAE), which gives each text block (like 'solution' or 'confidence') its own learning signal and applies it only to the tokens that control that block.
•A new Outcome-Conditioned Baseline (OCB) estimates fair reference scores for later blocks (like confidence) by grouping samples with the same intermediate outcome (like whether the solution was correct), without running extra expensive rollouts.
•On math problems with both solution accuracy and confidence calibration, BAE+OCB reduces interference between goals and matches or beats strong reward-design methods on calibration, with competitive accuracy.
•Compared to using a single, hand-tuned scalar reward (like RLCR), BAE avoids reward mixing and still learns useful, well-calibrated confidence.
•OCB works using only the samples already in each GRPO group, keeping compute low and training stable.
•BAE preserves test-time scaling benefits: when you sample multiple answers, confidence-weighted voting still boosts accuracy.
•The approach is modular, scales naturally to more objectives, and also helps in a two-attempt refinement setting for math.
•Limitations include needing enough samples per outcome group and clearly defined segment boundaries.
•This is a practical recipe for multi-objective RL in structured generations, improving credit assignment without extra inference cost.

Why This Research Matters

When apps answer questions for people, we don’t just want them to be right—we want them to say how sure they are, and to learn the right lessons from each part of their response. This method teaches models to improve different skills (like solving and self-checking) without mixing signals and confusing them. Because it avoids extra expensive rollouts, it’s practical at scale for long answers and many objectives. Better calibration means safer systems that can flag uncertainty instead of bluffing, which is crucial for education, healthcare triage, and coding assistants. The approach is modular, so it naturally extends to more steps (like reflect-then-revise workflows) and very long tasks. In short, it helps build models that are both smarter and more trustworthy.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how a school report card has different subjects like math, reading, and science, and it wouldn’t be fair to give one single grade for everything? 🥬 The Concept: In training large language models (LLMs) with reinforcement learning (RL), many methods used to hand out one big score (a single scalar advantage) to an entire response, even when the response has different parts trying to do different jobs. How it works (old way): 1) The model writes a whole answer. 2) A reward function turns many goals into one number. 3) That one number is pushed back onto every token equally. Why it matters: This mixes signals from unrelated goals, so the model can get confused about which part did well or poorly, leading to credit assignment errors and goal interference. 🍞 Anchor: If a math solution is good but the confidence statement is bad, or vice versa, the old method still pushes the same reward onto every token, punishing the good part and rewarding the bad part.

🍞 Imagine building a LEGO castle with rooms for different purposes: kitchen, bedroom, and library. You wouldn't judge all rooms by how well the kitchen cooks, right? 🥬 The Concept: Many modern LLM tasks are structured into clearly labeled parts (blocks), like a reasoning/answer block followed by a confidence block. How it works: 1) The prompt asks for structured output with tags like <answer> and <confidence>. 2) Each block has its own clear objective. 3) Rewards can be checked with verifiers (e.g., was the answer correct? does the confidence match correctness?). Why it matters: When you can tell which tokens control which goal, you don’t have to mix everything into one reward anymore. 🍞 Anchor: In math with confidence, the answer block determines correctness; the confidence block should reflect uncertainty. They deserve different learning signals.

🍞 Think of a coach trying to improve a relay team by giving the same feedback to all runners regardless of leg. 🥬 The Concept: Group Relative Policy Optimization (GRPO) is a popular, critic-free RL method that normalizes rewards within a group of samples from the same prompt and applies a single advantage to all tokens. How it works: 1) Sample several responses for one prompt. 2) Score each response. 3) Compute a within-group mean (and sometimes variance). 4) Use the normalized score (advantage) for all tokens in each response. Why it matters: It’s simple and stable—but when tasks have multiple segments with separate goals, this single advantage can misattribute credit across segments. 🍞 Anchor: If the answer is right but the confidence is overconfident, GRPO’s one-score fits-all update still nudges both answer and confidence tokens the same way.

🍞 You know how making one megascore from many goals (like taste, nutrition, and appearance in cooking) can be tricky? 🥬 The Concept: Reward scalarization mixes multiple objectives (accuracy, calibration, style) into one reward number. How it works: 1) Designers choose how much to weigh each sub-goal. 2) The model optimizes the weighted sum. 3) Tuning the weights is delicate; wrong weights cause trade-offs or even exploitation. Why it matters: It’s fragile and can lead to reward hacking or mode collapse, especially in complex, long responses. 🍞 Anchor: RLCR works well but still mixes goals; changing one term can accidentally hurt another.

🍞 Picture pausing a video game midway to predict your final score from that point. 🥬 The Concept: Conditional value estimation asks, “Given what has happened so far (the prefix), what reward should I expect for the next block?” How it works: 1) Later blocks depend on earlier sampled text. 2) A good baseline for a later block should reflect the state after the previous block. 3) The classic way to estimate it is to do Monte Carlo rollouts from that state—but that’s expensive. Why it matters: Without a good, state-matched baseline, advantages for later blocks are noisy or biased, slowing or misguiding learning. 🍞 Anchor: Confidence reward depends on whether the solution was correct; the baseline for the confidence block should differ for correct vs. incorrect solutions.

🍞 Imagine wanting to simulate many possible futures from a checkpoint in a game to see the average outcome. 🥬 The Concept: Monte Carlo rollouts generate many continuations from an intermediate state to estimate expected rewards. How it works: 1) Freeze the prefix. 2) Sample many endings. 3) Average the rewards. Why it matters: It’s accurate but computationally expensive for long texts and many prompts. 🍞 Anchor: For long math solutions, extra rollouts for every confidence block would be too costly at scale.

All this sets the stage: we want a method that (1) gives each block its own learning signal, (2) fairly compares later blocks to similar states, and (3) does not require extra expensive sampling. That is exactly what this paper delivers with Blockwise Advantage Estimation and its Outcome-Conditioned Baseline.

02Core Idea

🍞 Think of a report card where math, art, and PE each get their own grade, and only the art class updates your art skills. 🥬 The Concept: The key insight is to give each text block its own advantage, then apply that advantage only to the tokens in that block—using an outcome-conditioned baseline so later blocks get compared fairly to similar prefixes. How it works: 1) Split the completion into segments (blocks). 2) Compute a reward for each block (e.g., correctness for solution, calibration for confidence). 3) For block 1, use standard group normalization as the baseline. 4) For block k>1, compute a baseline by grouping samples that share the same intermediate outcome (like correct vs. incorrect solution). 5) Form per-block advantages (reward minus baseline) and apply them only to the block’s tokens in a PPO-style update. Why it matters: This reduces interference between goals, improves credit assignment, and avoids hand-tuning a single mixed reward. 🍞 Anchor: In math+confidence, the solution block learns from correctness, and the confidence block learns from calibration relative to correct/incorrect strata.

Multiple analogies:

School subjects: Each subject has its own test and grade; practicing math shouldn’t change your art grade.
Relay race: Each runner (block) is judged on their segment, not on the entire race’s final time alone.
Multi-course meal: Appetizer, main, and dessert each have their own taste test; dessert shouldn’t be punished because the appetizer was salty.

Before vs. after:

Before: A single advantage per completion mixed signals from unrelated goals; later blocks lacked fair, state-matched baselines unless you paid for extra rollouts.
After: Each block gets its own advantage, routed only to its tokens; later blocks use outcome-conditioned baselines computed within the same group, no extra rollouts.

🍞 You know how, to judge a high-jump attempt, you compare it to attempts with the same conditions (like wind direction)? 🥬 The Concept: Outcome-Conditioned Baseline (OCB) is a compute-friendly way to approximate the right baseline for later blocks by grouping samples with the same intermediate outcome. How it works: 1) Define a simple outcome (e.g., solution correct or incorrect). 2) Within the group, compute the mean reward for the confidence block separately for each outcome. 3) Subtract the matching mean from each sample’s confidence reward to get its advantage. Why it matters: This mimics conditioning on the prefix without extra rollouts, lowering variance and improving credit assignment. 🍞 Anchor: For confidence learning, we compare confidence rewards among solutions that were correct together and among those that were incorrect together.

Why it works (intuition):

Decoupling: Tokens only get feedback for the objective they control, so signals don’t clash.
Fair comparison: Later blocks are compared to others that started from similar states (same outcome), reducing noise.
Compute efficiency: It uses the samples you already collected in GRPO; no new sampling is required.

Building blocks:

Clear segmentation of the output into blocks.
Per-block rewards tied to verifiable objectives.
A first-block baseline via group mean; later-block baselines via OCB.
Per-block PPO-style updates with block-specific advantages.
Optional block-length normalization so short blocks (like confidence) aren’t overwhelmed by long ones (like reasoning).

03Methodology

At a high level: Prompt → Sample a group of completions → Parse into blocks → Compute per-block rewards → Build per-block baselines (group mean for block 1, OCB for later blocks) → Form blockwise advantages → Apply PPO-style updates per block → Updated policy.

Step-by-step (with the Sandwich pattern for key ideas):

Sampling and parsing into blocks 🍞 Imagine you ask several students the same question and collect all their full, tagged answers. 🥬 The Concept: For each prompt, sample a group (e.g., 32) of completions and split each completion into contiguous blocks (e.g., solution block, confidence block) using deterministic tags. How it works: 1) Use the same prompt and decoding settings to get multiple completions. 2) Parse text by tags to get blocks X1, X2, …, XK. 3) Record token ranges for each block. Why it matters: Clear blocks let you map each objective’s reward to the tokens that actually control it. 🍞 Anchor: In math+confidence, X_sol contains the reasoning and final answer; X_conf contains analysis and a numeric confidence.
Per-block rewards 🍞 You know how a quiz can have different rubrics: correctness for the answer and calibration for how sure you were? 🥬 The Concept: Assign a reward r_k to each block that matches its objective (e.g., correctness for solution; a proper scoring rule like Brier or BCE for confidence). How it works: 1) Extract the final answer and check with a verifier to get correctness c∈{0,1}. 2) Parse reported confidence q∈[0,1]. 3) Compute r_sol=c; compute r_conf from (q,c) using Brier or BCE. Why it matters: Rewards are local and verifiable, reducing the need to mix objectives into one fragile scalar. 🍞 Anchor: If the answer is right but overconfident or underconfident, the confidence block’s reward adjusts independently of the solution block’s reward.
Baselines and advantages for block 1 🍞 Think of grading the first section of an exam by comparing to the class average on that same section. 🥬 The Concept: For the first block, all samples share the same starting state (the prompt), so a simple group mean baseline works well. How it works: 1) Compute the mean reward over the group for block 1. 2) Advantage = r_1 − group_mean_1. Why it matters: It reduces variance without mixing in different prefixes (since there are none yet). 🍞 Anchor: For solution blocks across the group, subtract the group’s average correctness-based reward to get stable learning signals.
Conditional baselines for later blocks with OCB 🍞 To judge a second-leg runner fairly, you compare runners who received the baton in the same position (lead vs. trailing). 🥬 The Concept: Outcome-Conditioned Baseline groups later-block samples by a discrete outcome of the prefix (e.g., whether the solution was correct). How it works: 1) Define outcome o from the prefix (correct/incorrect). 2) Split the group into strata by o. 3) Compute a stratum mean for the later block’s rewards. 4) Advantage = r_k − mean_in_matching_stratum. Why it matters: It approximates E[r_k | prefix] using data already in the group, avoiding costly Monte Carlo rollouts and reducing variance. 🍞 Anchor: For confidence, compare only against others with the same correctness outcome.
Forming blockwise advantages and updating only the right tokens 🍞 Picture giving feedback stickers only on the page where that work was done. 🥬 The Concept: Apply each block’s advantage only to its tokens using a PPO-style clipped objective. How it works: 1) Calculate per-token likelihood ratios (new policy vs. old policy). 2) Use clipped PPO loss with the block’s advantage for tokens in that block. 3) Average within each block so short blocks don’t get drowned out by long ones. Why it matters: Prevents cross-objective interference and keeps training balanced. 🍞 Anchor: Confidence tokens get updated by confidence advantages; solution tokens get updated by solution advantages.
Secret sauce (why this recipe is clever)

Locality: Each objective teaches the tokens that control it—no more one-score-fits-all.
Conditional fairness: OCB compares apples to apples (e.g., correct with correct), keeping later-block baselines aligned with their starting states.
Compute thriftiness: It reuses the same group of samples—no extra rollouts, no critic network.

Concrete example with data:

Suppose, for one prompt, you sample 32 completions. 20 have correct solutions (c=1), 12 are incorrect (c=0).
For the confidence block, compute the average Brier reward separately for the c=1 group and for the c=0 group.
For a sample with c=1 and reward r_conf=−(q−1)^2, subtract the c=1 average to get its advantage.
Only the confidence tokens in that sample receive this advantage update.

Optional choices and stability notes:

No advantage standardization (center but don’t scale) can work well in GRPO-like training.
Entropy regularization can be dynamically adjusted to avoid collapsing diversity.
KL regularization can be set to zero when PPO clipping and entropy control already keep training stable.

By following these steps, BAE produces cleaner, lower-variance, per-objective learning signals that match how structured text is actually produced.

04Experiments & Results

🍞 You know how when you test a thermometer, you check both if it shows the right temperature and if it admits when it’s unsure? 🥬 The Concept: The authors test both solution accuracy and how well the model’s reported confidence matches reality (calibration). How it works: 1) Datasets: MATH500 (in-domain), GSM8K (easy out-of-domain), AIME23–25 (hard OOD). 2) Metrics: Accuracy (right answer), ECE and Brier (calibration), AUROC (how well confidence separates right from wrong). 3) Competing methods: RLCR (single scalar reward with Brier term), Group Mean (unconditioned baseline), Batch Mean, and None. Why it matters: These tests show whether BAE+OCB actually reduces interference and keeps confidence useful in practice. 🍞 Anchor: On MATH500 with Qwen models, BAE+OCB often matches RLCR on accuracy and improves calibration.

The scoreboard with context:

MATH500 (Qwen2.5-7B-Base): BAE+OCB reaches about 75.1% Pass@1, matching RLCR (~75.0%), and improves ECE (0.032 vs. 0.043). That’s like tying on the main test while being better at honestly reporting uncertainty.
MATH500 (Qwen2.5-3B-Instruct): OCB sharply improves calibration (ECE ~0.030) compared to RLCR (~0.059), with competitive accuracy. Think of it as going from a C to an A on the “honesty” scale while keeping a solid exam score.
Group Mean can look good on some in-domain metrics (e.g., low ECE) but is brittle under distribution shift: on GSM8K with 7B-Instruct it shows much worse calibration (high ECE and Brier, low AUROC). It’s like a student who does fine on the practice book but struggles on real exams.

Surprising findings:

Compute-free conditional baselines (OCB) can match the Monte Carlo truth closely for later-block advantages, without extra rollouts. In a focused study, OCB had the lowest RMSE vs. MC estimates across correctness strata.
A naive swap to BCE inside a single mixed reward (like RLCR+BCE) can cause reward hacking—models hiding behind very low confidence or refusing to answer. But training the confidence block with BCE under BAE+OCB avoids this failure, eventually matching Brier-trained results if trained longer.
Test-Time Scaling (TTS): Sampling multiple candidates and using confidence-weighted voting still boosts accuracy with BAE+OCB, showing that learned confidence is actionable, not just a pretty metric. Performance tracks closely with RLCR across datasets.

Put simply: OCB is the strongest compute-free baseline for later blocks; BAE+OCB reaches a better accuracy–calibration trade-off than alternatives in-domain and remains competitive out-of-domain, all while avoiding delicate reward mixing.

05Discussion & Limitations

Limitations:

Outcome-Conditioned Baselines need enough samples in each outcome stratum (e.g., enough correct and incorrect solutions). If one stratum is tiny, the mean is noisy and calibration can wobble. Smoothing or blending with the global group mean can help.
Segments must be clearly defined. If tags are missing or boundaries are fuzzy, signals can leak across blocks or be misapplied.
OCB assumes the chosen outcome captures most of what matters about the prefix for the next block’s reward. If finer details matter a lot, a coarse outcome (like just correct/incorrect) may introduce bias.

Required resources:

Group-based training with moderate group sizes (e.g., 32–64) and batch sizes that ensure each group’s strata are populated.
A verifier or parser to produce per-block, verifiable rewards (e.g., correctness, confidence extraction).
Usual PPO/GRPO compute for LLM RL, but no extra rollout cost for baselines.

When not to use:

Tasks without stable, parseable segments or where objectives are deeply entangled across all tokens.
Settings where the later-block reward depends on very fine-grained prefix details that a simple outcome label cannot capture.
Extremely imbalanced regimes where one outcome is vanishingly rare within groups.

Open questions:

Learned or multi-bin outcomes: Can we cluster prefixes by richer features (e.g., verifier margins, tool feedback) to get even closer to the true conditional baseline?
Beyond verifiable rewards: How well does BAE+OCB work when rewards come from preference models or softer heuristics?
Longer horizons and more blocks: How does performance scale with many sequential objectives and very long contexts?
Dynamic segmentation: Can the model propose blocks on the fly, with learned boundaries and objectives?

Overall, BAE+OCB is a practical middle path: better credit assignment than one-scalar methods and far cheaper than Monte Carlo rollouts, provided segments and outcomes are well chosen.

06Conclusion & Future Work

Three-sentence summary: This paper introduces Blockwise Advantage Estimation, which gives each text block its own learning signal and applies it only to the tokens that control that block. For later blocks, an Outcome-Conditioned Baseline compares samples that share the same intermediate outcome (like correct vs. incorrect), approximating the right state value without any extra rollouts. On math tasks with solution and confidence, this reduces goal interference, matches or beats strong baselines on calibration, preserves test-time gains, and stays compute-friendly.

Main achievement: A modular, critic-free, GRPO-compatible method that fixes cross-objective credit assignment by routing objective-specific advantages to their corresponding token segments, with a scalable conditional baseline for later blocks.

Future directions: Enrich outcomes beyond binary correctness (e.g., verifier margins, tool signals), learn outcome groupings automatically, and test on long-horizon, multi-step agent tasks and non-verifiable rewards. Explore dynamic segmentation and hybrid baselines that blend outcome-conditioned and global statistics for stability.

Why remember this: It turns multi-objective RL for structured generations from a fragile reward-mixing problem into a clean credit-assignment problem aligned with how text is written—one block at a time—bringing better calibration, competitive accuracy, and lower compute.

Practical Applications

•Math tutors that solve problems and report honest confidence, helping students know when to double-check.
•Coding assistants that produce a fix and then verify it, training each step with its own signal.
•Document agents that draft, review, and finalize sections, optimizing each stage separately.
•Search-and-verify systems where the model first proposes an answer and then justifies or checks it.
•Medical triage helpers that give a preliminary assessment and a confidence score, so humans can prioritize cases.
•Legal or policy drafting assistants that produce an argument and then a risk/confidence analysis.
•Data pipeline agents that generate a transformation and then validate outputs, with separate rewards per step.
•Customer support bots that answer and then self-rate reliability, escalating low-confidence cases to humans.
•Long-horizon planning agents that plan, act, and reflect in stages, each trained with its own objective.
•Ensembled decision systems that use confidence-weighted voting for better final choices.

Version: 1