Heterogeneous Agent Collaborative Reinforcement Learning
Key Summary
- ā¢This paper introduces HACRL, a way for different kinds of AI agents to learn together during training but still work alone during use.
- ā¢The main algorithm, HACPO, lets agents share verified practice attempts (rollouts) and learn from one another safely and fairly.
- ā¢HACPO fixes two big problems: agents having different skills and their answers coming from different distributions.
- ā¢It adds four tools: capability-aware advantage estimation, a capability discrepancy coefficient, exponential importance sampling, and stepwise clipping.
- ā¢Across seven tough math benchmarks and many model pairings, HACPO boosts every agent and beats GSPO by an average of 3.3% while using only half the rollout cost.
- ā¢HACPO works even when agents are very different in size, training, or architecture (including different tokenizers).
- ā¢The paper proves the advantage estimates are unbiased and that learning from others points in the right optimization direction.
- ā¢Ablation studies show each component (advantage estimator, capability coefficient, exponential IS, and stepwise clipping) is needed for stability and gains.
- ā¢Compared with multi-agent RL, HACRL needs no coordination at deployment; compared with distillation, it enables two-way learning instead of one-way teaching.
Why This Research Matters
HACRL/HACPO shows we can get the best of both worlds: teamwork during training, simplicity during deployment. This saves compute because verified rollouts get reused across agents instead of being thrown away after a single use. It also raises accuracy, because different agents bring different strengths and exploration patterns that complement each other. The method is robust enough to handle real-world heterogeneity, including different model sizes and even different model families. With unbiased advantages and aligned gradients, engineers can trust that shared learning wonāt push models off track. In practice, this can speed up model improvement cycles and reduce costs in math solving, coding assistants, and other verifiable tasks.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine a group of kids solving math puzzles. If each kid works alone, they all spend time trying the same wrong ideas. But if they share what worked (and what didnāt), everyone learns faster and wastes less time.
š„¬ The Concept: Reinforcement Learning (RL) is how an AI learns by trying actions and getting rewards. In modern language-model RL, thereās a special form called Reinforcement Learning with Verifiable Rewards (RLVR), where we can automatically check answers (like running a unit test on code or grading a math answer) and give the model clear, trustworthy feedback. How it works, step by step:
- The model gets a question (prompt).
- It generates several answers (rollouts).
- A checker verifies which answers are correct and gives rewards.
- The model updates itself to make good answers more likely next time. Why it matters: Without verifiable rewards, feedback can be fuzzy. With RLVR, AI can reliably improve at tasks like math and coding. š Anchor: Think of a spelling quiz where a computer immediately tells you if a word is spelled right. That quick, sure feedback helps you learn faster.
š Hook: You know how two friends might both practice basketball free throws after school? If they never talk, they both re-discover the same tricks separately. Thatās slower and more tiring.
š„¬ The Concept: Today, many large language model (LLM) agents do on-policy RL separately. They each roll out answers, verify them, and only use their own results. That wastes a lot of compute because everyone is solving the same kinds of prompts but not sharing the tries. How it works now, step by step:
- Each agent makes its own rollouts.
- Each agent checks and learns only from its own rollouts.
- Repeat this for every agent separately. Why it matters: Verification and sampling are expensive. If we donāt share, we pay the same cost again and again and miss out on each otherās useful mistakes and successes. š Anchor: Itās like five kids each buying the same math workbook and doing every problem alone. If they compared notes, they could skip repeated dead ends.
š Hook: Picture a mixed team: a chess whiz, a mathlete, a coder, and a writer. Each is good at something different.
š„¬ The Concept: Heterogeneous agents are AI models that differ in ways like size, training state, or architecture. Some are bigger, some are instruction-tuned, some come from different model families. How it works, step by step:
- We group agents by differences: same family but different states, same family but different sizes, or entirely different architectures.
- Each agent solves the same tasks but starts with different strengths.
- Their rollouts reflect these differencesāsome find unique right answers, others create informative mistakes. Why it matters: Diversity means more kinds of ideas and errors to learn fromāif we can share them safely. š Anchor: In a math club, the geometry-lover and the algebra-lover spot different solution paths. Together they cover more ground.
š Hook: You know how group projects are great during practice, but on test day, you need to work alone?
š„¬ The Concept: The paper proposes HACRLāHeterogeneous Agent Collaborative Reinforcement Learningāwhere agents learn collaboratively during training by sharing verified rollouts, but each agent can act independently at inference time. How it works, step by step:
- During training, all agents generate responses and verify them.
- Agents share rollouts and learn from each other with special safety rules.
- At test time, each agent runs by itself; no coordination required. Why it matters: You get the benefits of teamwork during practice without needing teammates during the real test. š Anchor: Itās like scrimmaging together before the game, then each player plays their own position confidently on game day.
š Hook: Imagine a recipe swap where chefs share dishes and tasting notesābut they adapt the spice level to their own restaurantās style.
š„¬ The Concept: Before this paper, people tried multi-agent RL with coordinated execution, or knowledge distillation where a strong teacher trains a weaker student one-way. But those either need coordination at deployment or canāt return the favor to the teacher. How they worked and why limits appeared:
- Multi-Agent RL: Agents must coordinate during use; not always practical if you deploy one model.
- Distillation: One-way transfer; the teacher doesnāt learn from the studentās fresh discoveries. Why it matters: We need a system where everyone learns from everyone, but deployment stays simple. š Anchor: HACRL says, āPractice together, perform solo.ā Itās collaborative without needing a buddy on stage later.
š Hook: Think of a class where every studentās quiz attempts are graded and saved. If everyone could reuse everyone elseās graded attempts, the class would need fewer new quizzes.
š„¬ The Concept: The big gap this paper fills is sample efficiency and mutual learning across different agents. By reusing verified rollouts n times in an n-agent system, HACRL cuts down on expensive sampling and lets agents trade ideas. How it works:
- Collect rollouts from all agents for the same prompts.
- Share the verified results.
- Each agent learns from this pool with careful weighting and safety steps. Why it matters: Saves compute, stabilizes learning, and pushes performance higher for everyone. š Anchor: Itās like a shared answer key for practice where you learn from both right answers and common mistakesāfaster and cheaper.
02Core Idea
š Hook: You know how two runners can train togetherāone fast and one steadyāand both get better faster than if they ran alone?
š„¬ The Concept: The key insight in one sentence: Let different agents practice together by sharing verified tries, but make the sharing smart and safe so each agent still learns in the right direction and can run alone later. How it works:
- Pool everyoneās verified rollouts.
- Adjust for whoās stronger or weaker right now (capability-aware baselines and scaling).
- Correct for distribution differences (exponential importance sampling).
- Keep updates stable (stepwise clipping that only downweights cross-agent samples). Why it matters: Without careful sharing rules, strong agents might get dragged off-distribution, or weak signals might overwhelm training. With the rules, all agents improve together. š Anchor: Like runners swapping training tips but keeping their own pace plans, they both PR on race day.
Multiple analogies for the same idea:
- Study group: Different students compare solved problems; they weigh advice more from the class topper but still learn from clever mistakes classmates made; no one needs a buddy on test day.
- Cooking club: Chefs swap tasting notes; they give more weight to trusted palates yet still value new, interesting experiments; each chef later cooks alone with improved recipes.
- Sports drills: Teammates share play videos; a coach evaluates each playerās current skill and decides how much to copy from whom; each player later performs solo but better.
š Hook: Imagine what changes if friends who practiced apart now practice together with safety rules.
š„¬ The Concept: Before vs. After. Before:
- Agents did RLVR alone, repeating the same sampling costs.
- No exchange of complementary knowledge.
- Distillation was one-way; stronger models couldnāt learn from weaker ones. After (HACRL/HACPO):
- Every rollout can be reused up to n times in an n-agent team.
- Bidirectional learning: strong helps weak and weak still helps strong (unique paths and mistakes).
- Deployment stays simple: each agent still runs alone. Why it matters: Better accuracy with less sampling cost and broader problem coverage. š Anchor: Itās like swapping practice tests in a class; everyone learns more in the same study time.
š Hook: Picture two scales: one adjusts for whoās better at the moment, the other adjusts for how different their answer styles are.
š„¬ The Concept: Why it works (intuition behind the math):
- Capability-aware advantage estimation sets a fair baseline per agent using everyoneās rewards but weights them by relative skill. That keeps advantages unbiased and well-calibrated.
- A capability discrepancy coefficient scales gradients from others: upweight signals from stronger peers, downweight noisy signals from weaker ones.
- Exponential importance sampling reduces the effect of big distribution gaps between agents so you donāt overfit to someone elseās style.
- Stepwise clipping only allows downweighting cross-agent samples and tightens over minibatches, stopping instability from creeping in. Why it matters: These four pieces act like seat belts, speed limits, and lane markersāsafe sharing that still gets you there faster. š Anchor: When combining advice from different coaches, you listen more to the one whoās usually right, but you never let outside advice overpower your own training plan.
š Hook: Think of building blocks that click together to make a sturdy bridge between agents.
š„¬ The Concept: Building blocks of HACPO:
- Rollout sharing (the pool of verified tries).
- Agent-Capability-Aware Advantage Estimation (fair baselines).
- Capability Discrepancy Coefficient (gradient scaling by relative skill).
- Exponential Importance Sampling (soft correction for style mismatches).
- Stepwise Clipping (stability guardrails). Why it matters: Without any one block, the bridge wobbles or collapses; with all blocks, learning is smoother and stronger. š Anchor: Itās like LEGO: remove a key brick and the model falls apart; keep them all and the build is solid.
03Methodology
At a high level: Prompts ā Agents generate multiple responses (rollouts) ā Rewards are verified ā Pool all rollouts ā For each agent: compute capability-aware advantages and corrected weights ā Apply conservative, stable updates ā Improved agent policies.
Now the steps, each with sandwich explanations of the key concepts it relies on:
- Rollout Sharing š Hook: You know how teammates share game replays so everyone learns faster without playing ten extra games each? š„¬ The Concept: Rollout sharing means agents put their verified responses into a shared pool so every agent can learn from more examples with the same sampling cost. How it works:
- For each prompt, each agent produces G responses.
- A verifier scores them (e.g., is the math answer right?).
- All verified results go into one pool. Why it matters: Each sample can help n agents instead of just oneāhuge savings in sampling cost. š Anchor: Like one studentās worked solution helping the entire study group instead of just themselves.
- Agent-Capability-Aware Advantage Estimation š Hook: Imagine a fair race where starting lines are adjusted so runners with different speeds have equal chances to improve. š„¬ The Concept: Advantage says how much better a response is than the expected baseline. Here, the baseline mixes everyoneās rewards but weights them by each agentās recent performance, creating a capability-aware baseline per agent. How it works:
- Track each agentās smoothed recent performance.
- When computing the baseline for agent k, weigh othersā rewards by their capability ratio relative to k.
- Use the joint reward spread for stable scaling. Why it matters: If you ignore capabilities, the baseline can be biasedātoo easy for strong agents (over-optimistic) or too hard for weak agents (discouraging). This keeps learning fair and, as proven, unbiased. š Anchor: A strong model doesnāt get overconfident from an easy comparison set; a weaker model isnāt punished by unfairly high bars.
- Capability Discrepancy Coefficient (Gradient Modulation) š Hook: When asking for advice, you lean more on the person whoās often right, and less on someone still learning. š„¬ The Concept: Scale the learning signal coming from another agent based on relative capability: amplify gradients from stronger peers, attenuate those from weaker ones. How it works:
- Compute a capability ratio between the helper and the learner.
- Multiply the helperās advantage by this ratio when updating the learner. Why it matters: Prevents noisy updates from weaker agents overwhelming learning while still allowing them to contribute valuable exploration. š Anchor: You follow your chess coach more closely than a beginner friend, but a beginner can still show a fresh tactic now and then.
- Exponential Importance Sampling š Hook: If two friends write in very different handwriting, you copy their notes carefully and maybe a little less literally. š„¬ The Concept: Importance sampling adjusts for differences between where a sample came from and where your model currently is. HACPO uses a sequence-level ratio, then dampens it with an exponential factor to avoid overreacting to big differences. How it works:
- Compute the ratio of your policyās likelihood to the source agentās likelihood for the whole response (normalized by length).
- Apply a stop-gradient exponential term that reduces the weight when distributions differ a lot. Why it matters: Prevents getting pulled too hard toward another modelās style, keeping updates conservative and stable. š Anchor: You learn from a friendās solution but donāt let their very different style overwrite your own approach.
- Stepwise Clipping š Hook: Trim a hedge little by little so you donāt accidentally chop off a big branch. š„¬ The Concept: Clipping limits how much cross-agent samples can influence an update. HACPO uses an asymmetric clip that never lets cross-agent samples upweight beyond self-generated samples and tightens the clip across minibatches within a step. How it works:
- Cross-agent importance weights are clipped to [1 ā Ī“, 1.0], never above 1.0.
- Across minibatches, the lower bound moves up (tightens), recognizing policy drift as updates accumulate. Why it matters: Stops late minibatches from being hijacked by off-distribution samples and keeps training steady. š Anchor: Even if an outside sample looks tempting, it canāt overpower what your own policy already prefers.
- Independent Execution with Collaborative Optimization š Hook: Practice as a team, perform as a soloist. š„¬ The Concept: Train with shared rollouts and cross-agent signals, but at inference time each agent runs alone without coordination. How it works:
- All safety mechanisms are used only during training updates.
- The learned policy is standalone for deployment. Why it matters: You get shared learning gains without making deployment complex. š Anchor: You scrimmage together; on game day, each player plays their role independently.
Example with actual data flow:
- Input: A math word problem from the MATH dataset.
- Agents: Qwen3-1.7B-Base and Qwen3-4B-Base both generate 8 candidate solutions.
- Verifier: Checks each final numeric answer and intermediate steps when possible.
- Pool: 16 verified responses (some right, some wrong) go into the shared pool.
- For Qwen3-1.7B: Compute capability-aware baseline using both agentsā rewards with performance-weighted averaging; apply exponential IS to Qwen3-4Bās samples; clip cross-agent weights; scale gradients by the capability ratio; update.
- For Qwen3-4B: Same process, but now learning from 1.7Bās distinct exploration, errors, and occasional unique correct paths.
The secret sauce:
- The mix of fairness (capability-aware baselines), trust calibration (capability scaling), conservativeness (exponential IS), and stability (asymmetric, stepwise clipping) makes cross-agent sharing helpful instead of harmful. The theory shows advantages stay unbiased and gradients stay aligned with the right direction, while experiments confirm consistent gains and lower rollout costs.
04Experiments & Results
š Hook: Think of a science fair where teams try different experiments but share lab notes. If everyone reads everyone elseās notes carefully, results improve without buying twice the materials.
š„¬ The Concept: The authors trained on 7.5k math problems from MATH and tested on seven benchmarks: MATH-500, MATH, GSM8K, AIME2025, AMC23, Minerva, and Olympiad. They compared HACPO to GRPO, GSPO, a resource-equivalent GSPOĆ2 baseline (double rollouts and updates), and a Naive rollout-sharing baseline without HACPOās careful controls. How it works:
- Agents were paired across three heterogeneity types: state (e.g., Qwen3-4B vs Qwen3-4B-Instruct), size (e.g., Qwen3-1.7B-Base vs Qwen3-4B-Base), and model (e.g., Qwen3-4B-Base vs Llama3.2-3B-Instruct).
- All methods used the same verifiable reward setting; HACPO reused each batchās verified rollouts across agents with its safety mechanisms. Why it matters: This setup tests whether collaboration helps in easy and hard settings, similar and very different models, and against strong baselines. š Anchor: Itās like checking if a study group helps not just within one class section but across grades and even different schools.
The competition and scoreboard:
- Against GSPO, HACPO improved average accuracy by about 3.3% on tough math tasksālike turning a solid B into an Aāāand did so with only half the rollout cost.
- In heterogeneous state (Qwen3-4B vs Qwen3-4B-Instruct): Both gained, with the Instruct model also improving; collaboration was not purely one-way.
- In heterogeneous size (Qwen3-1.7B-Base vs Qwen3-4B-Base): Both improved strongly. The smaller model added diverse exploration (useful mistakes and a few unique correct solutions), while the larger model provided strong guidance.
- In heterogeneous model (Qwen3-4B-Base vs Llama3.2-3B-Instruct): Despite different tokenizers and architectures, both models increased performanceāshowing the method works across families.
- The Naive sharing baseline underperformed, proving HACPOās safety mechanisms are necessary.
- The GSPOĆ2 resource-equivalent setting (more data and updates) still didnāt match HACPO consistently, indicating the value of bidirectional, capability-aware transfer rather than just more data.
Surprising findings and insights:
- Even stronger agents learned from weaker ones due to complementary explorationāfresh wrong turns that reveal unseen traps and occasional unique right paths.
- Sequence-level importance sampling with exponential damping and asymmetric clipping was crucial; without it, cross-agent samples could destabilize late minibatches.
- The theoryās promises held up: unbiased advantages and gradient alignment translated into stable, consistent gains.
Ablations that make numbers meaningful:
- Remove capability-aware advantage estimation: performance drops, confirming standard group-relative baselines become biased in heterogeneous settings.
- Remove capability scaling in gradient modulation: performance drops, showing the importance of listening more to stronger peers without muting weaker peers entirely.
- Vary exponential IS strength (alpha): more damping = more stability but less aggressive learning; the best alpha depends on the model pair.
- Remove stepwise clipping or the clip entirely: training gets unstable or converges worse; stepwise asymmetry clearly stabilizes cross-agent learning.
Context on scores:
- Think of 0.87 accuracy vs. 0.84: thatās the difference between missing roughly 16 out of 100 questions and 13 out of 100ānoticeable on hard sets.
- HACPOās consistent gains across MATH, GSM8K, AIME2025, AMC23, Minerva, and Olympiad show itās not a one-benchmark trick but a robust training recipe.
Bottom line:
- HACPO makes collaboration reliably helpful instead of risky.
- It achieves better accuracy with fewer rollouts, which saves time and compute.
- The improvements show up across model sizes, states, and even model families.
05Discussion & Limitations
š Hook: Even the best study plan has trade-offsālike balancing practice variety with not getting confused by too many styles.
š„¬ The Concept: Limitations and honest assessment.
- What it canāt do: If a collaborator is very off-distribution or consistently low-quality, even with damped weights and clipping, their samples may add little value. HACPO is designed to reduce harm but canāt turn noise into gold.
- Required resources: You need verifiable reward pipelines (e.g., graders, unit tests), multiple agents available during training, and infrastructure to pool, weight, and clip cross-agent samples.
- When not to use: If you only have one agent, or if verification is unavailable or too expensive, the sharing advantage shrinks. If deployment requires real-time multi-agent coordination, standard MARL may be a better fit.
- Hyperparameter sensitivity: The benefits depend on carefully setting alpha for exponential IS, clipping ranges, and smoothing windows for capability estimates; different agent pairs may need tuning.
- Open questions: How does HACPO scale to many agents beyond pairs? Can automated schedules tune alpha and clipping per-pair on the fly? What happens in domains where verification is partial or noisy rather than crisp and binary? Why it matters: Knowing boundaries helps you deploy HACPO where it shinesātasks with verifiable rewards, multiple available agents, and a desire for independent deployment at test time. š Anchor: If your school has good graders and several study partners, HACPO is like the perfect shared practice plan; if not, its benefits fade.
06Conclusion & Future Work
š Hook: Think of a team that practices together but performs soloāand each member still gets better than if they had practiced alone.
š„¬ The Concept: Three-sentence summary. HACRL introduces collaborative training for heterogeneous agents that still execute independently at inference time. HACPO turns this idea into a practical algorithm by sharing verified rollouts and adding four guardrails: capability-aware advantages, capability scaling, exponential importance sampling, and stepwise clipping. Together, they deliver unbiased estimates, aligned gradients, strong stability, and consistent accuracy gains with fewer rollouts. Why it matters: It upgrades sample efficiency and pushes models past their self-learning ceilings by tapping into complementary exploration and mutual guidance. š Anchor: The main achievement is showing that smart, safe sharing makes everyone better, even across different sizes and model families.
Future directions:
- Automating hyperparameter schedules (e.g., alpha and clipping) and per-pair adaptivity.
- Extending to more than two agents and dynamically forming collaboration groups.
- Applying beyond math and code to RLVR-friendly domains like data cleaning, theorem proving, or configuration optimization.
Why remember this: HACRL/HACPO reframes post-training for LLMsāpractice together, perform aloneāunlocking higher accuracy at lower rollout cost, and proving that heterogeneous peers can learn from each other without tripping over differences.
Practical Applications
- ā¢Training math-reasoning models: Share verified solutions and errors across differently tuned models to boost accuracy faster.
- ā¢Code generation with unit tests: Reuse passing and failing test runs across agents to cut compute and improve robustness.
- ā¢Automated theorem proving: Share proof attempts (successful and dead-ends) among diverse provers to explore more strategies safely.
- ā¢Data cleaning and validation: Pool verified cleaning scripts or rules across agents to converge on better pipelines.
- ā¢Configuration and optimization tasks: Share evaluated configurations across models to accelerate search.
- ā¢Educational AI tutors: Train diverse tutor agents collaboratively so each improves, then deploy the best single agent per classroom.
- ā¢Multi-vendor model improvement: Let models from different families safely learn from each otherās verified outputs without joint deployment.
- ā¢Low-resource fine-tuning: Use HACPO to achieve higher gains with fewer rollouts when compute budgets are tight.
- ā¢Safety auditing with verifiable checks: Share verified refusals and corrections to improve consistency across agents.
- ā¢Enterprise model updates: Run periodic collaborative training sprints across internal models to reduce training time and cost.