SkillOrchestra: Learning to Route Agents via Skill Transfer

Jiayu Wang; Yifei Ming; Zixuan Ke; Shafiq Joty; Aws Albarghouthi; Frederic Sala

SkillOrchestra: Learning to Route Agents via Skill Transfer

Beginner

Jiayu Wang, Yifei Ming, Zixuan Ke et al.2/23/2026

arXiv

Key Summary

•SkillOrchestra is a new way to make teams of AI models and tools work together by thinking in terms of skills, not just picking one big model for everything.
•Instead of training an expensive all-in-one router with reinforcement learning, it builds a reusable Skill Handbook from past experiences.
•The Skill Handbook lists what skills are needed in each mode (like search or code), and how good and costly each agent is at those skills.
•At run time, the orchestrator first decides the mode, then picks the best agent by balancing expected skill success against cost.
•This skill-aware routing avoids 'routing collapse,' where an RL router overuses one strong but pricey model.
•Across 10 benchmarks, SkillOrchestra beats state-of-the-art RL routers by up to 22.5 percentage points while cutting learning cost by up to 700×.
•It stays on the Pareto frontier, meaning it gets higher accuracy for the same or less money than other methods.
•The Skill Handbook transfers across different orchestrator backbones without retraining, making it future-proof as model pools change.
•Using the right level of skill detail matters: too fine-grained can confuse smaller orchestrators; the paper picks the best granularity with validation.
•Overall, explicit skill modeling makes orchestration more accurate, cheaper, more balanced, and easier to maintain.

Why This Research Matters

In real life, we often need the right expert at the right time—AI is no different. SkillOrchestra helps AI systems pick the best helper for each step, so they answer better and waste less money and time. This means chat assistants can research more accurately, coding copilots can fix bugs faster, and data tools can choose the best model without overspending. Companies benefit because they get higher quality at lower cost and can adapt quickly as new models arrive. Education and science benefit because complex, multi-step questions become easier to solve reliably. Overall, this approach makes AI teamwork more dependable, explainable, and affordable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re a coach picking which kid on your team should take each play. Some kids sprint fast, some pass well, and some are great at defense. If you always give the ball to the fastest kid, you’ll waste chances when a careful pass would score. Great teams win by matching skills to each moment of the game.

🥬 Filling (The Actual Concept – The World Before): In early AI systems, people often used just one big model to answer a question. Later, AI systems became like teams of helpers: some models are better at math, some at coding, some at web search. Orchestrating them—deciding which helper to use at each step—became key. A common approach, model routing, picks a single model for an entire query. But modern tasks are multi-step and change as you go: first you may need search, then plan, then code, then summarize. One model-choice at the start is too coarse.

🥬 Filling (What Was Tried And Why It Struggled): To be more flexible, researchers used reinforcement learning (RL) to train routers that make a sequence of choices. This is like training the coach to learn play-by-play decisions using rewards. It works, but it’s costly to train, hard to adapt when you add or swap models/tools, and it can suffer from routing collapse—where the router keeps picking the same big model over and over, even when smaller, cheaper models could handle parts of the job just as well.

🍞 Bottom Bread (Anchor): Think of asking an AI to do deep research: it should search the web, read results, maybe write a small script to check facts, then craft a final answer. Old routers would choose one model and hope it could do everything. RL routers might learn the steps but sometimes just keep calling the biggest, priciest model for every step, running up the bill and not always improving results.

🍞 Top Bread (Hook): You know how a Swiss Army knife has different tools for different needs? You pull out the scissors to cut, the screwdriver to twist, and the knife to slice. Using the wrong tool makes simple tasks hard.

🥬 Filling (The Problem): In AI orchestration, decisions were either too coarse (one-shot model choice) or too costly and unstable (RL that collapses to one strong model). The missing piece was a reusable way to describe skills—fine-grained abilities like “multi-hop retrieval,” “symbolic logic coding,” or “high-precision arithmetic”—and to map which agents do those skills well for a given cost.

🍞 Bottom Bread (Anchor): If the job needs careful arithmetic, don’t call the storytelling champ. If it needs browsing, don’t call the math whiz. You want a playbook telling you who’s good at what, and when to use them.

🍞 Top Bread (Hook): Imagine a cooking class notebook where you write which friend makes the fluffiest pancakes, who grills best, and who’s fast at chopping. Over time, you get a guide that helps you assign the right cook to each station without relearning from scratch.

🥬 Filling (The Gap Filled): SkillOrchestra creates a Skill Handbook from past runs. It captures (1) mode-level insights (when to search, code, or answer), (2) a library of reusable skills inside each mode, and (3) agent profiles—how likely each agent is to succeed at each skill, and how much it costs. At run time, the orchestrator detects which skills are needed now and picks the agent with the best accuracy-for-cost trade-off.

🍞 Bottom Bread (Anchor): For a math question that needs symbolic manipulation, the handbook steers you to an agent that’s strong at that skill but not overly expensive, instead of defaulting to the most massive model.

— Now let’s gently introduce the key concepts using the Sandwich pattern, in a friendly order that builds understanding.

Reinforcement Learning (prerequisite) 🍞 Hook: You know how pets learn tricks with treats? Do a trick, get a biscuit! 🥬 The Concept: Reinforcement learning teaches a decision-maker by giving it rewards for good choices.

How it works: (1) Try an action. (2) See what happens. (3) Get a reward or not. (4) Adjust choices to get more reward next time.
Why it matters: RL can learn complex multi-step strategies, like which model to call next. But it can be expensive and may settle into bad habits (like overusing one model). 🍞 Anchor: An RL router might learn to always call a big model because it often works—even if it’s wasteful.

Performance-Cost Trade-off 🍞 Hook: When buying a toy, you balance fun (performance) and price (cost). 🥬 The Concept: In orchestration we balance answer quality with money/time spent (tokens, latency).

How it works: (1) Estimate how well each option will do. (2) Estimate cost. (3) Pick the best overall value.
Why it matters: Great answers that are too pricey or slow aren’t practical; cheap answers that are wrong aren’t helpful. 🍞 Anchor: Choose a model that’s accurate enough without blowing the budget, like picking a solid mid-range bike over a gold-plated one.

Operational Modes 🍞 Hook: A gamer switches modes—build mode, battle mode, explore mode. 🥬 The Concept: Modes are high-level actions like search, code, or answer.

How it works: (1) Look at the state. (2) Pick the mode needed now. (3) Limit tools and agents to those that fit the mode.
Why it matters: Modes keep the system organized and focused at each step. 🍞 Anchor: If you need web facts, pick Search mode; if you need to compute, pick Code mode.

Routing Collapse 🍞 Hook: Imagine a team that only passes to one star player, ignoring open teammates. 🥬 The Concept: Routing collapse is when the router keeps choosing the same model even when others would be better for cost or fit.

How it works: A learning rule or habit leads to overusing a strong model for everything.
Why it matters: It wastes money and loses the benefits of specialization. 🍞 Anchor: An RL router picking the biggest model 98% of the time is routing collapse.

Skill Handbook 🍞 Hook: A coach’s playbook lists plays (skills) and which players run them best. 🥬 The Concept: The Skill Handbook is a reusable guide that maps modes to skills and skills to agent strengths and costs.

How it works: (1) Discover skills from past runs. (2) Record agent success and cost per skill. (3) Use this to pick who acts next.
Why it matters: It makes routing precise, balanced, and explainable, and you don’t have to retrain everything when the team changes. 🍞 Anchor: The handbook might say “Agent A: great at symbolic logic; medium cost,” which guides selection in coding tasks.

Skill Discovery 🍞 Hook: Like noticing new moves in soccer—“oh, that was a clever through-pass!” 🥬 The Concept: Skill discovery finds reusable abilities by comparing successes and failures.

How it works: (1) Contrast good vs. bad trajectories. (2) Ask: what capability made the difference? (3) Name and store the skill.
Why it matters: You only learn useful, reusable building blocks that matter for routing. 🍞 Anchor: If a failed run lacked “multi-hop retrieval,” you add that as a skill.

Agent Profiles 🍞 Hook: A player’s stat card lists strengths, weaknesses, and stamina. 🥬 The Concept: An agent profile stores how likely an agent is to succeed on each skill, plus costs and notes.

How it works: (1) Count successes/failures per skill. (2) Estimate success probability. (3) Track costs and gotchas.
Why it matters: Profiles let the orchestrator compare agents fairly and pick the best value for the task. 🍞 Anchor: “Agent B: 80% on data extraction, low cost; 30% on symbolic logic” steers choices.

Skill-Aware Orchestration 🍞 Hook: A music conductor assigns parts based on who plays strings, brass, or drums best. 🥬 The Concept: Route by skills: choose the mode, identify needed skills, pick the agent with the best skill match for the price.

How it works: (1) Detect mode. (2) Detect active skills. (3) Score agents: expected skill success minus cost. (4) Execute and repeat next turn.
Why it matters: You get better answers for less cost and avoid collapse. 🍞 Anchor: For a multi-hop QA step, pick the strong retriever; for final writing, pick a good summarizer.

Skill Conditioning (context) 🍞 Hook: Practicing free throws before a game improves performance on that specific skill. 🥬 The Concept: Preparing agents or the orchestrator to think in terms of skills and use them.

How it works: Provide clear skill definitions, examples, and usage notes in the handbook.
Why it matters: Clear skills make selection easier and more accurate. 🍞 Anchor: A crisp definition of “symbolic logic coding” reduces mistakes when choosing a coding agent.

02Core Idea

🍞 Top Bread (Hook): You know how LEGO sets have pieces that click together to build anything—from cars to castles? When you label the pieces and know which does what, building gets faster and more reliable.

🥬 Filling (The Aha! in One Sentence): SkillOrchestra turns orchestration into skill-aware decision-making by learning a reusable Skill Handbook that maps modes → skills → agent strengths and costs, so the orchestrator can pick the right helper at each step under an explicit performance–cost trade-off.

Multiple Analogies (3 ways):

Sports Team: Modes are plays (defense, offense); skills are positions (point guard, center); agents are players with stats. The coach (orchestrator) uses the team roster (handbook) to pick who should act now.
Kitchen Brigade: Modes are stations (prep, grill, plating); skills are tasks (chop fine, sear steak); agents are cooks with ratings and speed. The head chef uses a notebook to assign tasks for best taste per minute and cost.
Orchestra: Modes are sections (strings, brass); skills are parts (melody, harmony); agents are musicians with strengths. The conductor assigns parts to get the richest music with efficient rehearsal time.

Before vs After:

Before: One-shot model routing missed changing needs mid-task. RL routers learned sequences but were expensive and often collapsed to one model, wasting budget.
After: A handbook-driven orchestrator understands which skills are needed now and selects the best agent for those skills at a fair price, staying flexible as tasks evolve and avoiding collapse.

Why It Works (intuition, no equations):

Decomposition: Big tasks are made of smaller skill demands. If you match each demand to the right agent, you do better overall.
Evidence-driven: Success/failure per skill accumulates into probabilities (from simple counts), so selection uses real track records.
Cost-aware: Scoring blends “how likely to succeed” with “how much it costs,” so the router doesn’t just chase accuracy; it chases value.
Transferable memory: The handbook is reusable across different orchestrators and model pools; you update the memory, not re-train everything.
Granularity control: You can choose coarser or finer skills to match how smart your orchestrator is—fewer, broader skills for small routers; finer skills for stronger routers.

Building Blocks (broken into smaller pieces with Sandwich explanations):

Skill Handbook 🍞 Hook: A board game rulebook that tells you the moves and the best strategies. 🥬 The Concept: A three-part guide: (i) mode insights (when to search/code/answer), (ii) a registry of skills per mode, (iii) agent profiles with success and cost.

How it works: Learn skills from traces; attach agent stats; use the handbook at runtime to decide.
Why it matters: It turns orchestration from guesswork into looked-up, evidence-based decisions. 🍞 Anchor: The handbook might say: in code mode, for symbolic logic, Agent X succeeds often and is moderately priced—pick it.

Skill Discovery 🍞 Hook: Spotting the hidden trick that made the winning play. 🥬 The Concept: Find capability gaps by comparing a successful run with a failed one.

How it works: Contrast two trajectories; ask what ability explains the difference; name and index that skill.
Why it matters: You only store skills that affect outcomes, keeping the handbook compact and useful. 🍞 Anchor: Failure at multi-hop reasoning → add “bridge fact chaining” as a skill.

Agent Profiles 🍞 Hook: Trading cards with a player’s batting average and stamina. 🥬 The Concept: For each skill and mode, record an agent’s success rate (as a probability) and cost.

How it works: Count successes/failures; convert to a smoothed estimate; track latency and token usage.
Why it matters: Profiles enable fair, apples-to-apples comparisons under cost. 🍞 Anchor: If Agent A is 70% on multi-hop but cheap, and Agent B is 75% but very pricey, the router may prefer A.

Mode-Aware Selection 🍞 Hook: Use a recipe book section—desserts section for cakes, not for steaks. 🥬 The Concept: First pick the mode (search, code, answer) based on state and stored insights.

How it works: Use metadata about when mode transitions pay off.
Why it matters: Right mode shrinks the decision space and picks fitting tools. 🍞 Anchor: Need fresh facts → Search mode; have facts already → Answer mode.

Pareto-Optimal Handbook Selection (granularity control) 🍞 Hook: Choosing a map scale—street-level for walking, city-level for driving. 🥬 The Concept: Validate different handbook granularities and keep the one that best balances accuracy and cost for your orchestrator.

How it works: Try subsets on a validation set; pick those on the Pareto frontier.
Why it matters: Too-fine skills can confuse smaller orchestrators; the right granularity boosts reliability. 🍞 Anchor: Merge two near-duplicate coding skills into one broader skill for a 3B orchestrator; keep them split for a 32B orchestrator.

Skill-Grounded Routing Rule 🍞 Hook: A shopping list where you check price and quality before buying. 🥬 The Concept: Score each candidate agent as (expected skill success) minus (cost weight × cost).

How it works: Aggregate skill probabilities weighted by current needs; subtract cost; pick the max.
Why it matters: This keeps decisions practical and budget-aware. 🍞 Anchor: For the current step, Agent C’s 0.65 success at low cost beats Agent D’s 0.7 at triple the price.

Put together, these pieces let SkillOrchestra route like a calm, budget-savvy coach who knows the team’s exact strengths and when to use them.

03Methodology

High-level flow: Input (query + conversation state) → Mode selection using handbook → Active skill detection → Agent scoring by skill competence minus cost → Execute → Update state → Repeat until done.

Step-by-step recipe (with Sandwich teaching moments):

Build the global Skill Handbook from execution traces 🍞 Hook: Like building a class study guide by comparing what worked on homework vs what didn’t. 🥬 The Concept: From past runs, learn skills, estimate agent competence per skill, and extract mode insights.

What happens: (a) Collect trajectories where different agents were tried. (b) For each mode and query, contrast a success with a failure to isolate the missing capability. (c) Turn that into a named skill with a clear description and usage indicators (keywords, patterns, exemplars). (d) For every agent–skill pair, count successes and failures and convert to a stable success estimate. (e) Record costs (tokens, latency) and mode-level cues (when to switch modes).
Why this step exists: Without a handbook, every decision is a fresh guess or requires expensive RL training. The handbook makes knowledge reusable.
Example: In code mode, if successes used algebraic simplification and failures didn’t, add “symbolic manipulation” as a skill; Agent Q has 8 successes and 2 failures on it → roughly 80% with smoothing. 🍞 Anchor: Your handbook grows into a map: skills like multi-hop retrieval, symbolic logic, high-precision arithmetic, each with per-agent ratings and tool constraints.

Refine the skills (split/merge) to avoid bloat or confusion 🍞 Hook: Cleaning your notes so they’re not too messy or too tiny to read. 🥬 The Concept: Periodically split skills that hide multiple abilities or merge near-duplicates that don’t help routing.

What happens: (a) Detect high-variance skills (agent performance varies wildly) → candidate split. (b) Detect statistically indistinguishable skills → candidate merge. (c) Use an LLM to rewrite clearer definitions. (d) Update counts accordingly.
Why this step exists: Over-fragmentation confuses smaller orchestrators; redundancy wastes memory and attention.
Example: If “data_processing” lumps numeric estimation and symbolic logic, split into “numerical_approximation” and “symbolic_logic.” 🍞 Anchor: After merging, a 3B orchestrator stops misidentifying subskills and routes more reliably.

Select an orchestrator-specific subset of the handbook (granularity control) 🍞 Hook: Picking the right-sized backpack for a trip—you bring what you can carry well. 🥬 The Concept: Use a validation set to choose which modes, which skills, and which agent profiles (and at what granularity) best fit your orchestrator and budget.

What happens: (a) Try candidate subsets of the handbook. (b) For each, run end-to-end validation to measure reward minus cost. (c) Keep those on the Pareto frontier. (d) Optionally allow query-time augmentation by adding nearest-neighbor skills relevant to the query.
Why this step exists: A tiny orchestrator can’t reliably choose among 100 micro-skills; a larger one can.
Example: For a 3B router, keep “data_processing” (coarse). For a 32B, keep both “symbolic_logic” and “numerical_approximation.” 🍞 Anchor: The chosen subset gives the best accuracy for its cost among all tested options.

Inference-time routing loop 🍞 Hook: Following a recipe: pick the course (mode), choose the chef (agent), cook, then taste and adjust. 🥬 The Concept: At each turn, use the handbook to pick mode → identify active skills → score agents by competence minus cost → execute → move on.

What happens: (a) Mode selection: π_mode(ψ | state; mode-metadata). (b) Active skills: retrieve skills tied to ψ and relevant to the state (via indicators/embeddings). (c) Agent scoring: sum weighted success estimates over active skills and subtract λ × estimated cost. (d) Execute with the chosen agent; collect outputs; update state; repeat.
Why this step exists: It converts the handbook into concrete, budget-aware decisions step by step.
Example with data: Suppose active skills are {multi-hop retrieval: 70%, symbolic logic: 40%} for Agent A, and {multi-hop: 60%, symbolic logic: 60%} for Agent B. If B costs much more, A may win; otherwise B wins. 🍞 Anchor: For a QA question, the router first calls a retriever-strong agent in Search mode, then a summarizer-strong agent in Answer mode.

Cost modeling and competence estimation 🍞 Hook: Deciding between a bus pass and a taxi—you consider both price and arrival time. 🥬 The Concept: Keep per-mode costs (latency, tokens) and success estimates (from success/failure counts) in profiles.

What happens: Update Beta-like tallies (successes = α, failures = β) and use their means as competence; track per-mode average costs.
Why this step exists: Reliable estimates keep choices stable and transparent without expensive end-to-end RL.
Example: After 20 trials of a skill with 12 successes, the competence sits around 60% with smoothing, guiding choices. 🍞 Anchor: An agent with moderate accuracy but very low cost can be favored for easy steps.

The secret sauce (what makes it clever)

Explicit skills: The middle layer (skills) decouples tasks from agents, enabling specialization without collapse.
Reusability: The handbook transfers across orchestrators and updated model pools—update the handbook, not the router weights.
Granularity tuning: Pareto validation picks the right detail level for your orchestrator, avoiding overfitting to tiny distinctions.
Sample efficiency: Counting successes/failures per skill needs far less data than training an RL policy.
Interpretability: You can explain routing choices: “We used Agent X for symbolic logic because it has the best success-to-cost ratio for that skill.”

End-to-end example (tiny walk-through):

Query: “Prove-and-calc style math question.” State suggests Code mode. Active skills: symbolic_logic (high), high_precision_arithmetic (medium).
Profiles: Agent C: 0.7 on logic, 0.6 on arithmetic, medium cost; Agent D: 0.75, 0.7, high cost.
Score: If budget tight, pick C. If this is the final hard step, pay extra for D. Execute, get code result, then switch to Answer mode with a summarizer-strong agent to craft the final reply.

04Experiments & Results

The Test (what they measured and why):

They tested on 10 diverse benchmarks, including general QA (Natural Questions, TriviaQA, PopQA), multi-hop QA (HotpotQA, 2Wiki, MuSiQue, Bamboogle), and math reasoning (MATH, AMC). These tasks need different skills at different steps—perfect for checking if skill-aware routing helps.
Metrics: Accuracy (e.g., Exact Match) and total cost (money/latency). The key is the performance–cost trade-off, not just raw accuracy.

The Competition (baselines):

No routing: SFT, simple RAG, and CoT—single model tries to do all.
Heuristic/Discriminative routers: choose a model once per query using difficulty or classifiers (KNN Router, MLP, BERT, GraphRouter, FrugalGPT, etc.).
RL-based orchestration: Router-R1 (multi-turn, PPO-trained), ToolOrchestra (agents + tools with RL).

The Scoreboard (with context):

Model routing setting: SkillOrchestra beats all baselines on general and multi-hop QA. Compared to Router-R1 (about 41.6 EM), SkillOrchestra reaches ~47.4 EM (+5.8), and SkillOrchestra+ reaches ~51.6 (+10.0). That’s like going from a B– to an A–, and then to a solid A, while also spending less.
On tough multi-hop tasks (like MuSiQue, Bamboogle), the gains are especially big (e.g., Bamboogle ~51.2 → 58.4 → 63.2). For math (MATH/AMC), SkillOrchestra improves accuracy by up to 22.5 percentage points over Router-R1 while cutting cost roughly 2× during inference.
Pareto frontier: SkillOrchestra delivers higher accuracy at equal or lower cost than all heuristic and RL rivals. Example: Router-R1 hits ~41.6 EM at ~51.8¢; SkillOrchestra hits ~47.4 EM at ~38.4¢; SkillOrchestra+ ~51.6 EM at ~41.6¢. That’s like getting better grades and paying fewer tokens.

Surprising and Notable Findings:

No routing collapse: Router-R1 overwhelmingly picked a single large model (~98% of calls), driving up costs and missing specialization. SkillOrchestra spread calls across models (e.g., Mixtral-8×22B ~44.5%, Qwen2.5-7B ~26%, LLaMA-3.1-70B ~15.4%, Qwen2.5-3B ~11.5%), matching strengths to steps.
Transfer without retraining: A Skill Handbook learned with a small orchestrator (e.g., Qwen2.5-3B) also boosted larger orchestrators substantially, showing the handbook holds model-agnostic knowledge. Gains of +14.8 to +24.3 points were observed when transferring to stronger models.
Stronger isn’t always pricier: Some cheaper models produced shorter, more efficient reasoning traces, so total cost went down even if per-token prices were similar.

Full agent orchestration (modes + tools):

On FRAMES with modes like search, code, and answer, SkillOrchestra beat the RL-based ToolOrchestra in both accuracy and cost. Example: ~84% vs ~76% accuracy, with ~21.6% cost reduction. It also outperformed proprietary orchestrators under the same tool/model pools.
Ablations: Removing the Skill Handbook dropped accuracy (around 85% → 71%) and spiked cost (9.3 → 122.9). Using discovered skills without refinement/selection still helped, but the best trade-off came from discovering, refining (split/merge), and selecting the right granularity via validation. This shows “more skills” isn’t always better—use the right-sized toolkit for your orchestrator.

Bottom line: Across tasks and settings, skill-aware routing was both smarter (higher accuracy) and thriftier (lower cost), stayed stable (no collapse), and carried over to new orchestrators without extra training—strong evidence the Skill Handbook captures the right reusable structure.

05Discussion & Limitations

Limitations (be specific):

Skill granularity sensitivity: If the handbook is too fine-grained for a small orchestrator, it may misidentify subskills and route poorly. The paper addresses this with Pareto-based selection, but careful tuning is still needed.
Coverage gaps: The handbook is only as good as the experience it learned from. If the system encounters brand-new skills or domains, competence estimates may be uncertain until traces accumulate.
Indicator reliability: Detecting which skills are active in the current state relies on linguistic/structural indicators and embeddings. Ambiguous states can lead to noisy skill activation.
Cost modeling drift: Prices, latencies, and model behaviors change. Profiles require periodic refresh to stay accurate.
Tool constraints and failures: Even with the right skill-agent match, external tools or sandboxes can fail. The handbook notes such risks but can’t eliminate them.

Required Resources:

Execution traces with both successes and failures to discover and calibrate skills.
An LLM (or similar) to help abstract capability gaps into skill definitions and to reflect on split/merge operations.
Basic compute/storage to maintain the handbook graph and run validation for Pareto selection.

When NOT to Use:

Single-shot, simple tasks where one small, fixed model already performs well and cost is negligible—overhead may not be worth it.
Ultra-low-data boots where you cannot collect even a small batch of trajectories to seed useful skills.
Highly dynamic domains where tasks change faster than the handbook can be updated—consider hybrid approaches or faster refresh cycles.

Open Questions:

Automated granularity learning: Can we learn the optimal skill detail level end-to-end without manual selection loops?
Continual and federated updates: How to merge handbooks learned across teams or sites, resolving conflicts safely?
Skill compositionality: How can the handbook capture not just single skills but also common combinations (skill chains) and their joint costs?
Robustness to adversarial prompts: How to keep skill activation stable under noisy or adversarial phrasing?
Human-in-the-loop: What’s the best interface for experts to approve, edit, or annotate skills and profiles to accelerate convergence?

06Conclusion & Future Work

3-Sentence Summary: SkillOrchestra reframes orchestration as skill-aware decision-making, learning a reusable Skill Handbook that maps modes to skills and skills to agent strengths and costs. At run time, the orchestrator selects a mode, identifies active skills, and picks the agent that best balances expected success against cost, avoiding routing collapse. Across diverse tasks and full agent settings, it consistently outperforms heuristic and RL-based baselines while cutting costs and transferring across orchestrators.

Main Achievement: The #1 contribution is introducing explicit, reusable skill modeling as the middle layer for orchestration—turning scattered execution experience into a structured, transferable handbook that drives accurate, cost-aware decisions without expensive RL retraining.

Future Directions:

Learn optimal skill granularity automatically and adapt it online.
Extend to richer tool ecosystems (databases, simulators) and track tool-specific reliabilities.
Model skill chains and dependencies to plan multi-step routes even more efficiently.
Explore human-in-the-loop editing to speed up skill discovery and refinement.

Why Remember This: SkillOrchestra shows that naming and measuring the right building blocks—skills—lets AI teams act like well-coached squads: specialized, balanced, thrifty, and adaptable. As models and tools evolve, the Skill Handbook can evolve with them, keeping orchestration on the performance–cost Pareto frontier. It’s a practical recipe for scaling compound AI systems without breaking the bank.

Practical Applications

•Customer support assistants that choose cheaper models for routine FAQs and call stronger models only for tricky multi-hop cases.
•Research copilots that switch between web search, code execution, and summarization agents for deep investigations.
•Coding copilots that detect when to invoke a symbolic reasoning coder vs. a general-purpose model to reduce debug time.
•Business analytics pipelines that route data-cleaning vs. forecasting steps to agents with the best skill-to-cost ratios.
•Educational tutors that pick specialized math or reading comprehension agents based on the detected student skill gap.
•Healthcare knowledge assistants that balance retrieval and reasoning for clinical queries while controlling operational cost.
•Legal and policy analyzers that chain citation retrieval, clause comparison, and summarization with mode-aware routing.
•E-commerce chatbots that classify intent, retrieve catalog info, and generate offers using the best agent per skill.
•Scientific discovery tools that coordinate literature search, hypothesis generation, and code-based simulations.
•MLOps platforms that reuse the Skill Handbook across new model pools to maintain efficiency without retraining routers.

Version: 1