ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang; Han Zhang; Haixin Wang; Yidan Shi; Ruoyan Li; Kaiqiao Han; Chenyi Tong; Haoran Deng; Renliang Sun; Alexander Taylor; Yanqiao Zhu; Jason Cong; Yizhou Sun; Wei Wang

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Intermediate

Xiaoxuan Wang, Han Zhang, Haixin Wang et al.2/25/2026

arXiv

Key Summary

•This paper tackles why training AI agents that act over many steps (like browsing the web or moving in a house) often becomes unstable and collapses.
•It builds a clean practice arena called ARLArena so everyone can test ideas in the same fair way and see what truly helps stability.
•The authors split the training recipe into four dials: how we add up mistakes (loss aggregation), how we limit risky updates (importance-sampling clipping), how we pick useful practice runs (dynamic filtering), and how we score actions fairly (advantage design).
•They discover that gentle (tolerant) clipping causes sudden crashes, while sequence-level clipping keeps training calm and steady.
•Designing finer advantages that use environment clues helps the agent learn where and when it made good choices.
•Filtering practice runs dynamically helps most when the advantage signals are rich and diverse.
•Putting these insights together, they create SAMPO, a stable training method that improves success rates a lot and avoids collapse.
•SAMPO reaches 92.72% success on ALFWorld and beats popular baselines by about 25% on average, with smooth, monotonic learning curves.
•They also show a simple recipe—behavior cloning start, strict output format, and KL regularization—is crucial to avoid noisy mistakes early.
•The work offers a reproducible path to scalable, stable agent training that can power real tools like assistants, web agents, and embodied robots.

Why This Research Matters

Stable agent training means assistants can complete multi-step tasks reliably—shopping for items that fit constraints, scheduling across apps, or investigating a research question with tools. Robots and embodied agents can follow longer plans in homes, hospitals, and factories without spiraling into errors. Web and API agents can use tools safely and consistently, making fewer invalid calls and wasting less compute. Math and research agents can plan multi-step reasoning with code execution while avoiding brittle failures. Because the recipe is standardized and reproducible, teams can compare methods fairly and scale up training with confidence. This lowers costs, boosts reliability, and speeds up real-world deployment of helpful AI systems.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine training a soccer robot. In drills, it does great passes. But in a real game with crowds, noise, and pressure, tiny mistakes at the start can snowball into losing the ball again and again. That’s what happens to many AI agents today.

🥬 The Concept (Reinforcement Learning):

What it is: Reinforcement Learning (RL) teaches an agent by rewarding good actions and discouraging bad ones as it practices in an environment.
How it works:
1. The agent tries an action.
2. The environment reacts and gives feedback (a reward).
3. The agent adjusts its behavior to earn more reward next time.
Why it matters: Without RL, agents can’t learn from long chains of decisions where early moves affect later success, like planning a 10-step task. 🍞 Anchor: A cleaning robot learns not just to ‘vacuum now’ but to ‘move the chair first, then vacuum’, because the reward is higher when the floor ends up truly clean.

🍞 Hook: You know how a GPS can guide you on a road trip, but if it updates too slowly, you might take the wrong turn and then every next step goes off course?

🥬 The Concept (Policy Gradient):

What it is: Policy gradient is a family of RL methods that directly nudge the agent’s “decision rule” toward actions that led to higher rewards.
How it works:
1. Collect examples of what the agent did and how well it scored.
2. Increase the chance of good actions; decrease the chance of unhelpful ones.
3. Repeat many times until the policy gets strong.
Why it matters: Without direct nudging, the agent might wander aimlessly and learn too slowly. 🍞 Anchor: If a student gets As when they study with flashcards, their “policy” becomes “use flashcards more”; if cramming fails, they do less of it.

🍞 Hook: Think of a kid who not only answers a question but also clicks buttons on a computer, opens web pages, and uses tools to finish a project over many steps.

🥬 The Concept (Agentic Reinforcement Learning):

What it is: Agentic RL (ARL) trains language-model agents to act over multiple turns—planning, using tools, observing results, and continuing.
How it works:
1. The agent reads the current situation.
2. It thinks (plan), acts (tool call or command), and sees what happens.
3. It repeats across many turns until the goal is done.
Why it matters: Without ARL, models are good at single answers but weak at multi-step problem solving in changing worlds. 🍞 Anchor: A shopping agent that searches, filters, clicks products, and finally buys the right item must handle many decisions, not just one.

The world before this paper looked exciting but shaky. LLM agents could do web shopping, household tasks in a simulator (ALFWorld), math with tools, or puzzle games like Sokoban. But training often crashed. Why? Because small early mistakes—like producing a badly formatted action, clicking an invalid button, or taking a wrong step—led to huge shifts later. Rewards were sparse and delayed, so it was hard to know which step truly helped. And every update changed the agent a bit, making yesterday’s practice runs feel outdated for today’s updates.

Researchers tried different fixes: softer clipping (to allow bigger updates), different ways to add up losses, or filtering the data. These helped sometimes but also caused sudden collapses: gradients exploded, formatting broke, and success rates tanked.

🍞 Hook: You know how fair science experiments use the same lab setup so results are comparable?

🥬 The Concept (ARLArena):

What it is: ARLArena is a standardized training arena and analysis toolkit to test what really stabilizes ARL.
How it works:
1. Start the agent from good habits using behavior cloning (copying strong examples).
2. Enforce strict output format (like requiring <think> and <action> tags) so early errors don’t snowball.
3. Add a KL regularizer to keep the agent close to a reference model and avoid wild jumps.
4. Systematically test four training dials: loss aggregation, clipping, dynamic filtering, and advantage design.
Why it matters: Without a clean, fair setup, it’s impossible to know which changes truly fix instability. 🍞 Anchor: Like a sports league with standard-sized fields and rules so teams can be fairly compared.

Using ARLArena, the authors discovered the main culprits behind collapse and the remedies that truly help. Most crucially, they found that tolerant (too-forgiving) clipping gives quick early wins but later crashes. In contrast, sequence-level clipping—judging the whole action sequence together—keeps training safe and steady. They also found better advantage design (scoring that uses environment structure) helps learning, and dynamic filtering (smartly resampling practice examples) works best when advantages are informative.

🍞 Hook: Imagine combining the best parts of several good recipes into one reliable cake you can bake every time.

🥬 The Concept (SAMPO):

What it is: SAMPO is a new training method that unifies the winning ideas—sequence-level clipping, fine-grained advantages, and dynamic filtering—on top of the clean ARLArena setup.
How it works:
1. Begin with a stable starting policy (behavior cloning) plus format rules and KL regularization.
2. Compute advantages that mix global outcome and step-level context.
3. Clip at the sequence level to prevent any one token from wrecking the whole update.
4. Dynamically filter uninformative groups to focus learning on useful signals.
Why it matters: Without combining all these pieces, training remains fragile. With them, learning becomes smooth and scalable. 🍞 Anchor: SAMPO turns shaky mountain biking into a guided, well-marked trail where you steadily improve and don’t crash.

Real stakes: Stable ARL means assistants that can reliably browse the web to complete tasks, robots that can tidy rooms without spiraling into errors, and math/tool agents that can plan and adapt over many steps. It means fewer reruns, less compute wasted, and systems you can trust to scale to bigger, longer, more realistic tasks.

02Core Idea

🍞 Hook: Picture four volume knobs on a stereo system. If any one goes too high or too low, the music sounds bad. Balance them, and everything becomes clear and enjoyable.

🥬 The Aha! Moment: The key insight is that ARL stability comes from balancing four training dials—loss aggregation, importance-sampling (IS) clipping, dynamic filtering, and advantage design—and then combining the winning settings into one unified method, SAMPO, inside a clean, standardized arena.

Three analogies:

Cooking: Seasoning (advantage design), oven temperature (clipping), which bites to taste (dynamic filtering), and how you average many taste-tests (loss aggregation) all matter; the perfect dish needs all four balanced.
Sports practice: How you score drills (advantage), how hard you push athletes (clipping), which drills you repeat (filtering), and how you grade overall (aggregation) shape steady improvement.
Traffic control: Speed limits (clipping), which lanes to prioritize (filtering), how to rank route choices (advantage), and how to summarize traffic across the city (aggregation) keep traffic flowing smoothly.

Before vs. after:

Before: Many methods tweaked one dial, got short-term gains, then crashed—especially with tolerant (too-gentle) clipping.
After: A clean setup (behavior cloning + strict format + KL regularization), sequence-level clipping, richer advantages, and smart filtering together create stable, monotonic learning across tasks.

Why it works (intuition):

Clipping at the sequence level treats the whole decision (a multi-token action) as one unit, aligning with how rewards are given. This prevents a few wild tokens from hijacking the update.
Fine-grained advantage design adds context from the environment at both the big-picture (episode) and step levels, making credit assignment less noisy and more precise.
Dynamic filtering reduces time wasted on all-correct or all-wrong groups that teach little, but it only shines when advantages are informative enough to keep format learning on track.
A clean start (behavior cloning), formatting rules, and KL regularization keep early training from drifting into useless or invalid behavior, lowering noise and preventing collapse.

Building blocks (each with a mini Sandwich):

🍞 Hook: You know how report cards can average grades by class or by total points? That choice affects how students look. 🥬 The Concept (Loss Aggregation):

What it is: Loss aggregation is how we combine many token-level mistakes into one training signal.
How it works: Either average per token (token-mean) or average per sequence first then tokens (seq-mean-token-mean). Different schemes change how much short vs. long responses count.
Why it matters: Unbalanced weighting can bias learning toward short or long outputs, sometimes hurting tasks like math with very long solutions. 🍞 Anchor: If you reward each test equally, short quizzes can dominate; if you add up all points, long exams weigh more. The choice changes study habits.

🍞 Hook: Imagine a parent letting you try new tricks on a skateboard but setting a speed limit for safety. 🥬 The Concept (Importance-Sampling Clipping):

What it is: A safety brake that limits how much the new policy can differ from the old one during updates.
How it works: Tolerant clipping is too forgiving and can allow risky jumps; sequence-level clipping judges the whole action sequence together, keeping changes stable.
Why it matters: Without good clipping, early quick wins become late-stage crashes—formatting breaks, gradients explode, success plummets. 🍞 Anchor: With a sensible speed limit for the entire ride (sequence-level), you get home safely and steadily improve.

🍞 Hook: A coach repeats drills that are neither too easy nor impossible, because that’s where you learn fastest. 🥬 The Concept (Dynamic Filtering):

What it is: Picking which trajectory groups to keep or resample so the batch has more informative learning signals.
How it works: Filter out groups where every attempt is identical (all right or all wrong) and add ones that provide contrast.
Why it matters: Without it, you waste practice on uninformative data; with it (paired with good advantages), you learn faster and more stably. 🍞 Anchor: Instead of 10 identical free throws, you mix in varied shots that teach new skills.

🍞 Hook: When grading a science fair, you don’t just mark the final trophy—you compare each project to peers at that table. 🥬 The Concept (Advantage Design):

What it is: A fair score for how good an action or sequence was compared to others, sometimes using extra environment info.
How it works: Combine episode-level success with step-level groups sharing the same state to assign more precise credit.
Why it matters: Without nuanced advantages, the agent struggles to know which steps truly helped in long tasks. 🍞 Anchor: In ALFWorld, comparing different actions taken from the same room state clarifies which choice moved the task forward.

Putting it together (SAMPO in brief):

Start with a clean slate (behavior cloning, format penalty, KL regularization).
Use sequence-level clipping for calm, consistent updates.
Design advantages that mix overall outcome and per-step comparisons.
Apply dynamic filtering to focus on informative training groups.

The result: Fewer collapses, smoother curves, stronger final performance across very different tasks, showing that the balanced, unified recipe—not a single trick—is what brings stability to agentic RL.

03Methodology

High-level pipeline: Inputs (prompts, tools, environments) → Stable Testbed Setup → Four-Dial Analysis → Unified Method (SAMPO) → Outputs (stable policy with high success).

Step 0: Build a stable testbed

🍞 Hook: Before cooking a fancy meal, you clean the kitchen, preheat the oven, and lay out ingredients so nothing goes wrong mid-recipe.

🥬 The Concept (Behavior Cloning Initialization):

What it is: Start by copying high-quality example interactions so the agent already behaves reasonably.
How it works:
1. Collect self-generated but filtered good rollouts (keep only high-scoring ones).
2. Train the model to imitate them (SFT).
3. Begin RL from this safer starting point.
Why it matters: Without a good start, early RL can drown in invalid actions and noisy rewards. 🍞 Anchor: Like tracing a great drawing before freehand sketching—your first attempts are already close to right.

🍞 Hook: You’ve filled out forms online: if you don’t follow the boxes and labels, the website rejects it.

🥬 The Concept (Format Penalty):

What it is: A rule that the agent must wrap thoughts in <think> and actions in <action> tags; breaking the rule gets a penalty.
How it works:
1. Check every output for proper tags and structure.
2. Apply a fixed negative score if it’s malformed.
3. Provide dense signals early so the agent quickly learns valid formatting.
Why it matters: Without this, invalid outputs poison training and cause collapse. 🍞 Anchor: A submission portal that won’t accept missing fields forces you to get the format right fast.

🍞 Hook: Imagine a kite tied to a string so it doesn’t fly off in a gust.

🥬 The Concept (KL Regularization):

What it is: A gentle pull that keeps the trained policy close to a reference model, preventing wild jumps.
How it works:
1. Measure how far the current policy drifts from a reference.
2. Add a penalty if it drifts too far.
3. Balance exploration with safety.
Why it matters: Without KL, early aggressive updates can forget useful knowledge and break behavior. 🍞 Anchor: The string keeps the kite soaring but not lost—steady and controlled.

Step 1: Decompose policy gradient into four dials

Loss Aggregation (how to average mistakes)
IS Clipping (how to limit risky updates)
Dynamic Filtering (which trajectories to keep/resample)
Advantage Design (how to score actions fairly)

At this stage, the paper runs controlled experiments, flipping one dial at a time to see its impact on stability and performance.

Step 2: Findings that shape the unified method

Tolerant clipping (too forgiving at the token level) gives quick early wins but later collapses—gradients spike, formatting breaks, success falls.
Sequence-level clipping keeps training stable and steadily improving.
Advantage design that uses environment structure (episode + step groups) improves credit assignment.
Dynamic filtering helps most when paired with richer advantages; otherwise it can accidentally remove useful “format-fixing” signals.
Sequence masking (mask negative-advantage sequences with low ratios) stabilizes tolerant methods by muting harmful updates.

🍞 Hook: You don’t judge a book by a single word—you read the chapter.

🥬 The Concept (Sequence-level Clipping vs. Tolerant Clipping):

What it is: Sequence-level clipping measures change over the whole action sequence; tolerant clipping focuses on individual tokens and is lenient.
How it works:
1. Compute a sequence-wide change score.
2. Limit updates if the whole sequence deviates too much.
3. Prevent a few extreme tokens from dominating.
Why it matters: Without sequence-level control, token outliers can derail long-horizon learning. 🍞 Anchor: Rating the entire essay, not one sentence, leads to fairer, steadier grading.

🍞 Hook: When a few rotten apples spoil the bunch, you remove them before cooking.

🥬 The Concept (Sequence Masking):

What it is: Automatically ignore updates from harmful sequences (negative advantage and very off-policy) during training.
How it works:
1. Detect sequences that both underperformed and are far from the old policy.
2. Mask their contribution to the update.
3. Keep learning signals balanced.
Why it matters: Without masking, negative, off-policy samples trigger instability and collapse. 🍞 Anchor: Tossing out spoiled ingredients keeps the stew from turning bad.

Step 3: SAMPO—put the winning pieces together

Recipe for SAMPO:

Stable start and guardrails
- Begin from behavior-cloned policy.
- Enforce format penalty.
- Apply KL regularization. Why: Reduces early noise and keeps changes safe. Example: In ALFWorld, valid <action> tags quickly rise, so later learning focuses on useful strategies, not syntax.
Fine-grained advantages (episode + step level)
- Compute an overall outcome score (episode-level) and combine with per-state step comparisons (step-level).
- Weight them to get a balanced, informative advantage. Why: Helps the agent learn which particular decisions at which states pushed it toward success. Example: In Sokoban, comparing moves from identical board states clarifies which push reduced future mistakes.
Sequence-level clipping
- Measure how much the entire sequence’s probability changed.
- Clip updates if the change is too big. Why: Aligns the safety brake with how rewards are given (per sequence), stopping token-level outliers from destabilizing training. Example: On WebShop, no single overconfident word can cause a huge jump; updates stay smooth.
Dynamic filtering
- Remove groups where all sampled answers are identical in reward (all-right or all-wrong).
- Resample to increase useful contrast. Why: Focus compute on informative differences. Example: For math problems, avoid batches that teach nothing new; include attempts that reveal which steps improved accuracy.

Secret sauce:

It’s not any single part; it’s the combination. A clean start + sequence-level safety + precise credit + smart data curation. Together, they prevent collapse and maintain steady, monotonic improvement across very different tasks.

Example walkthrough (ALFWorld):

Input: A task like “Put the cooled apple slice on the plate.”
Steps:
1. Agent with good initial habits produces valid <action> sequences.
2. Advantages compare entire outcomes and per-room-step choices.
3. Sequence-level clipping prevents a risky jump after a surprising success/failure.
4. Dynamic filtering ensures the batch contains helpful contrasts.
Output: A policy that steadily improves success rate, reaching over 90% without mid-training crashes.

04Experiments & Results

The test: The authors measure stability and success across four diverse agent tasks:

ALFWorld (text-based household tasks): multi-step planning and acting.
WebShop (e-commerce): browsing, filtering, choosing products.
Sokoban (puzzle planning): pushing boxes onto targets with minimal mistakes.
TIR Math (tool-using math): multi-step reasoning with Python as a helper tool. Key metrics: Success rate and task score, plus training dynamics like KL drift, gradient norms, and valid-format ratio.

The competition: SAMPO vs. popular policy optimizers and variants:

GRPO (baseline), GSPO (sequence-level clipping), CISPO and SAPO (tolerant, token-level variants), GIGPO (fine-grained advantages), EMPG (uncertainty-shaped advantages), and DAPO (dynamic filtering framework). They also test loss aggregation choices.

The scoreboard (with context):

SAMPO consistently tops performance with smooth, monotonic learning curves.
On ALFWorld, SAMPO reaches 92.72% success—like getting an A+ while many others struggle with Cs or Bs due to midterm meltdowns. Averaged over agentic tasks, SAMPO improves about 25% over the GRPO baseline.
GSPO (sequence-level clipping) alone already beats most methods, showing the central role of sequence-aware safety.
CISPO/SAPO (tolerant clipping) show fast early gains but then collapse—like sprinting the first lap and then tripping hard. Gradient norms spike, KL divergence explodes, valid formats plummet, and success crashes.
GIGPO (fine-grained advantages) boosts performance, especially in environments like ALFWorld where state-structured comparisons matter.
EMPG helps in some tasks (e.g., WebShop) but not all, suggesting uncertainty signals are useful but task-dependent.
Dynamic filtering works best when combined with richer advantages (DAPO+GIGPO). With simpler advantages (DAPO+GRPO), it can accidentally remove useful format-learning signals and become unstable.
Loss aggregation choice (seq-mean-token-mean vs. token-mean) has mixed effects: it helps in some agent tasks but can hurt in math tasks with highly variable sequence lengths.

Surprising findings:

Tolerant clipping is not just mildly risky—it’s a hidden landmine. It delivers attractive early results, then triggers collapse dominated by negative-advantage, low-ratio samples.
A simple “sequence masking” trick—muting those harmful sequences—rescues tolerant methods dramatically, restoring stability and pushing success rates up.
Dynamic filtering is not universally good; it depends on the quality and diversity of advantage signals to avoid erasing essential early learning signals like format correction.
Off-policy staleness (using slightly outdated data within a rollout) measurably hurts performance. Lower staleness improves both ALFWorld and math benchmarks—like fresher ingredients make better food.

Meaning of the numbers:

SAMPO’s 92.72% success on ALFWorld means most household-style tasks are completed reliably, not just occasionally.
On WebShop and Sokoban, SAMPO improves success while keeping training curves smooth and safe—no scary spikes or crashes.
Compared to GRPO, SAMPO’s ~25% average boost is like moving from a solid B to an A across a semester.

Training dynamics visuals reinforce the story:

With tolerant clipping, KL and gradients surge before collapse, and valid-format ratios nosedive.
With sequence-level clipping (GSPO, and inside SAMPO), KL drift and gradients stay controlled, and valid formatting remains high.
After masking bad sequences, tolerant methods stabilize and look much more like their steady, sequence-level cousins.

Takeaway: Stability is not an accident; it’s engineered. The clean testbed plus the right choices on clipping, advantages, and filtering collectively deliver robust, reproducible gains over multiple agentic tasks.

05Discussion & Limitations

Limitations:

Environment coverage: While the tasks (ALFWorld, WebShop, Sokoban, TIR Math) are diverse, real-world settings can be even messier (noisy tools, partial observability, delayed rewards). Some findings may evolve with broader benchmarks.
Focus on policy gradient variants: The study centers on PPO-style optimizers and their design dials. Actor-critic with learned value baselines or model-based rollouts might introduce new stability considerations not fully explored here.
Engineering dependencies: The clean testbed (behavior cloning, format penalty, KL regularization) is essential; without it, the same algorithms may not show the reported stability.
Hyperparameter sensitivity: IS clipping thresholds are notably sensitive. Though ARLArena includes grid search, new domains will still require careful tuning.

Required resources:

Data and compute for supervised warm-start (behavior cloning) and for RL rollouts with environment interaction.
Infrastructure to enforce format checks, maintain reference-model KL, and perform dynamic filtering.
GPUs with sufficient memory for multi-turn contexts and larger models (e.g., 4B–8B) when scaling.

When NOT to use:

Extremely short single-turn tasks where RL adds little over standard SFT/RLHF.
Domains where formatting or action schemas cannot be enforced; without structure, instability risks rise.
Ultra-latency-sensitive applications where dynamic filtering or multi-turn rollouts are too costly.

Open questions:

Value learning and critics: Can learned baselines reduce variance further while preserving the simplicity and stability here?
Off-policy correction at scale: Beyond sequence-level clipping, what batching or replay strategies best limit staleness without excessive compute?
Curriculum design: How should we grow task difficulty and horizon length to maximize stable scaling laws for agents?
Multi-agent extensions: Does the same unified recipe hold when agents coordinate or debate, and how should advantages and clipping change for teams?
Safety alignment: Can format penalties and KL regularization be extended to richer safety checks (e.g., tool constraints, risk-aware plans) without slowing learning?

06Conclusion & Future Work

Three-sentence summary: This paper builds ARLArena, a clean training arena for agentic RL, and dissects stability into four dials: loss aggregation, IS clipping, dynamic filtering, and advantage design. It finds that sequence-level clipping is critical, richer advantages help, and dynamic filtering works best with informative signals; then it combines these into SAMPO, a unified, stable policy optimizer. SAMPO achieves strong, monotonic gains across tasks like ALFWorld, WebShop, and Sokoban, avoiding the crashes common in tolerant, token-level methods.

Main achievement: Turning instability analysis into a practical, unified recipe—SAMPO—that consistently prevents collapse and scales stable learning across diverse multi-step agent tasks.

Future directions:

Explore value-based critics and better off-policy corrections to reduce variance further.
Develop curricula for longer horizons and more complex, tool-rich environments.
Extend the framework to multi-agent coordination and richer safety constraints.

Why remember this: It shows that stability in agent training isn’t magic; it’s a well-balanced recipe. By standardizing the setup and combining sequence-level safety, precise credit, and smart data curation, we can train reliable, scalable agents that plan, act, and adapt over many steps—exactly what real-world AI needs.

Practical Applications

•Build reliable web-shopping assistants that search, filter, and purchase items matching tight constraints.
•Train home robots to execute multi-step chores (e.g., find, clean, place objects) with fewer missteps.
•Improve customer-support agents that navigate tools (ticketing, CRM, knowledge bases) across multiple actions.
•Enhance research copilots that plan searches, call APIs, run code, and synthesize results over many turns.
•Develop classroom tutoring agents that guide students through multi-step problem solving with code tools.
•Create workflow automations that coordinate calendars, emails, and documents with robust format compliance.
•Upgrade game-playing agents that plan long sequences (like Sokoban or strategy games) without training collapse.
•Enable safer API orchestration agents that follow strict output schemas, reducing invalid or risky calls.
•Scale training for enterprise LLM agents using the ARLArena recipe to benchmark methods fairly.
•Prototype multi-agent systems on top of stable single-agent SAMPO policies to coordinate complex tasks.

Version: 1