Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System

Longfei Yun; Yihan Wu; Haoran Liu; Xiaoxuan Liu; Ziyun Xu; Yi Wang; Yang Xia; Pengfei Wang; Mingze Gao; Yunxiang Wang; Changfan Chen; Junfeng Pan

Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System

Intermediate

Longfei Yun, Yihan Wu, Haoran Liu et al.2/20/2026

arXiv

Key Summary

•GEARS is a new way to improve big ranking systems (like what shows up first in your feed) by letting an AI agent explore options safely, instead of humans tweaking knobs by hand.
•It turns a fuzzy goal like 'boost long-term engagement but don’t hurt safety' into a clear plan the system can test and trust.
•GEARS uses 'Specialized Agent Skills'—little expert playbooks—to translate high-level vibes into concrete, testable policies.
•It checks every idea with strict 'governance hooks' that look for shaky features and short-term luck, so brittle policies don’t ship.
•A smart filter called tolerance-based Pareto filtering keeps not just the 'best' ideas but also near-best ones that may be more stable in real life.
•Across 20 real experiments and five task types, GEARS beat strong baselines on ranking quality and Top-1 accuracy (0.86), meaning it found better policies more often.
•Ablations show both deterministic pre-filtering and Skills matter; removing them clearly hurts results.
•Feature-stability checks over six months automatically block flashy-but-unstable features, helping policies last after launch.
•GEARS improved key metrics across many product surfaces, showing it’s a general framework, not a one-off trick.
•Bottom line: GEARS closes the ‘deployment gap’—it’s not just smart, it’s safe and shippable.

Why This Research Matters

GEARS helps real products make better choices for users by turning fuzzy goals into safe, testable, and reliable policies. Instead of launching a policy that looks great for a week and then fizzles, GEARS favors options that keep working over months. It also respects business guardrails like safety and fairness, so improvements don’t come with hidden costs. Engineers and data scientists save time because expert steps are captured as reusable Skills, and the system explains its reasoning. Even small percentage lifts compound at massive scale, meaning better experiences for millions of people. Ultimately, GEARS shows how AI can be both bold in exploration and careful in deployment.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how your favorite apps decide what to show you first—like videos, posts, or products? For years, teams used models that predicted what users might click, watch, or buy. That worked well for single goals and simple setups. But as apps grew, so did the maze: there are now many goals at once (long-term vs. short-term engagement, safety, quality, creator fairness), many moving parts (features come and go), and rules that can’t be written as neat equations (business priorities and safety guardrails).

The world before: Ranking systems were mostly about finding the “best” model and turning its score into an order. Personalization methods like uplift modeling tried to match each user (or group) with the best version of the product. They were great at crunching numbers offline, but these systems often ignored the messy parts of real production: some features drift over time, some are being deprecated, and some “wins” come from random noise that vanishes after launch.

The problem: The true slowdown wasn’t model math; it was the “engineering context constraint.” Turning a fuzzy product idea—like “nudge healthier habits without hurting safety”—into a precise, testable, and reliable policy is hard. Experts had to translate intent into hypotheses, pick segments, choose treatments, and watch out for hidden traps (feature instability, conflicts with other surfaces, or values that don’t last). This heavy manual work doesn’t scale, so lots of good ideas stayed undiscovered.

Failed attempts: • Static uplift models pick policies that look best on paper but may rely on fragile signals. • General LLM agents (“LLM-as-Ranker”) can reason in language, but without grounded context they hallucinate policies that sound plausible yet break in production. • Scalarizing many goals into one score locates only the convex part of the trade-off frontier, missing near-best points that are safer or more aligned with business vibes.

The gap: Teams needed a system that (1) accepts high-level intent in natural language, (2) explores a large space of cohort-and-treatment policies, (3) keeps not only the top-scoring policies but also near-best ones that may be sturdier, and (4) blocks anything that looks brittle when checked over time, cohorts, and infrastructure rules.

Real stakes: This matters to everyone. • Users: more relevant, safer, and fairer feeds. • Creators/Businesses: better discovery without hurting other goals. • Engineers/Scientists: fewer weeks spent wiring data and debugging fragile wins. • Companies: faster learning cycles with less risk.

Enter GEARS: Instead of treating optimization like a one-shot choice, GEARS turns it into an autonomous discovery loop. It captures expert know-how as reusable Skills, explores candidate policies with a large-scale HTE engine, and enforces production-grade checks with governance hooks. The result: policies that are not only strong on metrics, but also stable, interpretable, and actually deployable.

02Core Idea

The “Aha!” in one sentence: Treat ranking optimization as an agent exploring a safe, programmable world—translating high-level vibes into concrete, stable, and shippable policies.

Three analogies:

City planner: Imagine a planner who hears, “Make the city greener without slowing traffic.” Instead of drawing one map, they try small road tweaks, more bike lanes, check weather patterns, and keep what still works after months. That’s GEARS.
Chef with food testers: A chef hears, “Healthier and still yummy.” They test many recipes, keep near-best ones that stay popular all week, and toss any that spoil fast. That’s tolerance-based filtering plus governance.
Coach with playbooks: A coach hears, “Win more but avoid injuries.” They use playbooks (Skills), test lineups (policies), and only keep plays that work across games (stability). That’s agentic exploration with hooks.

Before vs. after:

Before: People hand-wrote hypotheses, models picked a single “best” policy, production discovered brittleness later.
After: An agent turns intent into a search spec, generates many candidate policies, keeps the best and near-best, and auto-filters those that won’t last in deployment.

Why it works (intuition):

Exploration over one-shot: Real systems are non-convex with trade-offs. Searching broadly finds more viable operating points.
Keep near-best: The “almost top” policy can be more stable, safer, or better aligned with business constraints than the number-one on paper.
Encode expertise as Skills: Playbooks prevent hallucinations and keep reasoning on-rails with code, SQL, and instructions.
Governance first: Deterministic hooks reject policies that “won” by luck, noisy features, or unstable cohorts, closing the deployment gap.
Right-sized context: Progressive disclosure feeds the model only what it needs when it needs it, avoiding context rot.

Building Blocks (Sandwich explanations, in learning order):

Pareto efficiency

Top Bread (Hook): Imagine sharing a pizza so no one can get more without someone else getting less.
Filling: • What it is: A state where improving one metric necessarily worsens another. • How it works: (1) List all metrics you care about. (2) Compare policies. (3) Mark those where no other policy is better on all metrics. (4) These form the Pareto frontier. • Why it matters: If you only chase one score, you might unknowingly hurt others; Pareto keeps trade-offs honest.
Bottom Bread (Anchor): Boosting long-term engagement may slightly lower short-term clicks; frontier points show the best trade-offs.

Multi-metric constraints

Top Bread: You know how you must keep good grades while also sleeping enough?
Filling: • What it is: Rules that say “improve A but don’t let B get worse.” • How it works: (1) Pick a primary goal. (2) Set guardrails. (3) Keep only policies that meet guardrails while improving A. • Why it matters: Without constraints, you might “win” by breaking something important.
Anchor: “Raise time-on-site, but do not reduce safety signals.”

Heterogeneous Treatment Effect (HTE)

Top Bread: Different plants love different amounts of sun.
Filling: • What it is: People respond differently to the same change. • How it works: (1) Split users into segments. (2) Estimate how each treatment helps each segment. (3) Match segments to best treatments. • Why it matters: One-size-fits-all wastes wins; personalization captures hidden gains.
Anchor: New users may prefer simpler feeds; power users may enjoy richer mixes.

Adaptive experimentation

Top Bread: A chef tastes and tweaks a recipe until it’s just right.
Filling: • What it is: Learn by trying, measuring, and improving. • How it works: (1) Try candidates. (2) Measure outcomes. (3) Keep good ones, change or drop weak ones. (4) Repeat. • Why it matters: Systems shift over time; adapt or lose gains.
Anchor: Weekly policy updates that lean into what continues to win.

Policy selection

Top Bread: A coach picks a game plan based on players and opponents.
Filling: • What it is: Choosing which treatments go to which cohorts under constraints. • How it works: (1) Generate candidates. (2) Score across metrics. (3) Apply guardrails. (4) Keep the best set. • Why it matters: The right policy set turns potential into real, measurable gains.
Anchor: “Active users get Treatment A; casual users get B; safety stays neutral.”

Specialized Agent Skills

Top Bread: Like mini-expert playbooks for tough tasks.
Filling: • What it is: Modular tools encoding expert ranking know-how (instructions + code/SQL + metadata). • How it works: (1) Route to the right Skill. (2) Follow step-by-step analysis. (3) Execute scripts. (4) Return interpretable outputs. • Why it matters: Reduces hallucinations and standardizes complex reasoning.
Anchor: A Skill that audits feature stability and flags deprecations before selection.

Vibe Optimization

Top Bread: A band tunes instruments to match the room’s mood.
Filling: • What it is: Steering with high-level intent that the agent translates into constraints and objectives. • How it works: (1) Take natural-language goals. (2) Build a search spec. (3) Explore aligned policies. (4) Explain trade-offs. • Why it matters: Stakeholders talk in goals, not raw weights; vibes bridge human language to machine action.
Anchor: “Favor long-term loyalty, keep safety steady” becomes a concrete search with guardrails.

Deterministic Lifecycle Governance

Top Bread: A pilot’s preflight checklist, every time.
Filling: • What it is: Hard checks around every agent action to ensure stability and reproducibility over time. • How it works: (1) Register hooks. (2) Test statistical robustness, feature stability, cohort consistency. (3) Reject and explain failures. (4) Iterate. • Why it matters: Blocks short-term flukes and unstable features from reaching production.
Anchor: A policy that “wins” due to a drifting feature is auto-rejected with reasons.

GEARS (the framework)

Top Bread: Think of a smart lab assistant that designs, runs, and vets experiments for you.
Filling: • What it is: An agentic system that turns high-level intent into stable, deployable ranking policies. • How it works: (1) Intent → search spec. (2) Generate candidates via HTE. (3) Analyze with Skills + Knowledge Brain. (4) Enforce governance hooks. (5) Output shippable config. • Why it matters: It closes the “deployment gap”—from good ideas to safe launches.
Anchor: Product asks for “boost loyalty, hold safety”; GEARS returns a ready-to-ship cohort policy.

Tolerance-based Pareto Filtering

Top Bread: Choosing rides at a theme park: not only the single top thrill, but also fun near-top rides that everyone enjoys safely.
Filling: • What it is: Keep Pareto-optimal and near-optimal candidates within uncertainty bands. • How it works: (1) Random-weight search to find candidates. (2) Define per-metric tolerance tied to uncertainty. (3) Keep policies not strictly dominated beyond tolerance. • Why it matters: Near-best often means more stable and practical in the real world.
Anchor: A policy 0.02% below the top on paper but far more stable over six months stays in the final set.

03Methodology

At a high level: Intent (natural language) → Search Specification → Candidate Generation (GAS + tolerance expansion) → Insight-Driven Selection (Skills + Knowledge Brain + progressive disclosure) → Governance Validation (deterministic hooks) → Production-ready policy.

Step A: Intent to Search Specification

What happens: The operator states a goal, e.g., “Improve long-term engagement without hurting safety.” GEARS converts this into objectives, guardrails, cohort constraints, and experiment parameters.
Why it exists: Human goals are fuzzy; optimization needs structure. Without this, the agent might chase the wrong target or miss constraints.
Example with data: “Metric 1 up; Metric 2 neutral” becomes: primary objective = Metric 1; guardrail: Metric 2 ≥ 0; allowed cohorts: activity level, tenure; experiment: use past 4 weeks data with confidence intervals.

Step B: Candidate Generation via GAS + Tolerance Expansion

What happens: GEARS uses a large-scale HTE engine (GAS) to produce many cohort-treatment policies. It samples weight vectors over metrics, keeps the top-K per weight, then applies tolerance-based Pareto filtering: any policy not strictly dominated beyond per-metric uncertainty stays.
Why it exists: Real systems have non-convex trade-offs; a single scalarized optimum misses valuable operating points. Near-frontier policies can be safer and more aligned with constraints.
Example: If Policy P1 slightly beats P2 on paper but P2’s confidence intervals overlap and P2 uses more stable features, both are kept for deeper analysis.

Step C: Insight-Driven Policy Selection (Skills + Knowledge Brain)

What happens: Specialized Skills analyze candidates—feature stability audits, guardrail interpretation, trade-off diagnosis, and feature explanations. A Domain Knowledge Brain supplies curated past results, edge cases, and platform quirks. Progressive disclosure loads only the metadata first, instructions at activation, and scripts at execution, to avoid context rot.
Why it exists: Complex, procedural reasoning is where LLMs can drift. Skills anchor the agent with code and checklists. The Knowledge Brain grounds decisions in real evidence. Progressive disclosure keeps reasoning clean and focused.
Example: A Skill checks that “Feature 4” shows ~50% quantile shift over 6 months; it flags it as unstable and removes policies that depend on it.

Step D: Deterministic Lifecycle Governance (hooks)

What happens: Every action/output is checked with deterministic validators: statistical robustness (confidence intervals, consistency across cohorts), feature stability (user-cohort shift ratio thresholds), and persistence over time windows (e.g., monthly backtests). Failures return structured error reports that prompt the agent to refine or discard the policy.
Why it exists: Short-term wins often disappear after launch. Governance blocks noisy victories, deprecations, or cohort artifacts.
Example with thresholds: A pre-search filter requires Rs_hift ≤ 15% (Binary) or ≤ 45% (Quantile) over 6 months. Feature 4 (~50% shift) fails and is excluded.

Step E: Output a Production-Ready Configuration

What happens: The remaining policy is translated into a deployable config: which cohorts get which treatments, with annotations on trade-offs, stability evidence, and guardrail checks. It can be auto-wired into the iterated ranking pipeline.
Why it exists: Scientists and engineers need something they can ship, explain, and maintain.
Example: “Active users → Treatment A (+0.10% Metric 1, neutral Metric 2); Casual users → Treatment B (+0.08% Metric 2, neutral Metric 1); all pass stability checks; backtest stable over one month.”

Concrete mini-walkthrough:

Input: “Increase loyalty (Metric 1) without hurting safety (Metric 2).”
Spec: Objective=Metric 1 up; Guardrail=Metric 2 ≥ 0; Cohorts by activity and tenure; 95% CIs.
GAS: Sample weights; collect Top-K policies per weight; keep near-frontier via tolerance.
Skills: Run feature stability audit; drop policies relying on Feature 4 (∼50% shift); explain remaining features.
Governance: Check cohort consistency and time-slice persistence; reject a policy that spikes for one week only.
Output: A two-cohort policy that lifts Metric 1 overall and keeps Metric 2 neutral; documentation auto-generated.

The secret sauce:

Combining broad search (GAS) with near-frontier admission (tolerance) prevents overfitting to a narrow optimum.
Encoding hard-won expertise as Skills + Knowledge Brain keeps reasoning grounded and repeatable.
Deterministic hooks enforce real-world durability, so what ships keeps working after the honeymoon period.

04Experiments & Results

The test: The team used 20 internal experiments. For each, GAS generated hundreds of candidate policies with metric values and confidence intervals. They then created five instruction types per experiment (maximize both, maximize-with-constraint, trade-off analysis, efficiency, single-metric), totaling 100 tasks. The agent had to select and rank policies per instruction.

The competition: GEARS was compared to Naive prompting, Chain-of-Thought, Self-Consistency, Self-Refine, and Code-as-Action. All used the same data; baselines differed in prompting or execution style.

Metrics: Precision@K, Recall@K, NDCG@K, Top-1 Accuracy, Top-1-in-Ground-Truth, and Spearman rank correlation. K ∈ {1,3,5}.

Scoreboard (with context):

GEARS achieved NDCG@5 ≈ 0.96 and Top-1 Accuracy ≈ 0.86. Think of that as regularly getting an A/A+ where others hovered around B/B+.
Code-as-Action was the closest baseline (e.g., Top-1 Acc ≈ 0.68), but GEARS still led by a wide margin, showing that Skills + governance + tolerance filtering add meaningful power beyond just executing code.
GEARS also had stronger global rank correlation, meaning it didn’t just find good items—it ranked them in the right order more reliably.

Ablations (what mattered most):

Removing deterministic pre-filtering (“w/o Bash”) tanked Top-1 and ranking quality. This shows that cleaning the candidate set first stabilizes downstream reasoning.
Removing Skills (“w/o Skill”) still beat baselines but lost notable performance vs. full GEARS, proving Skills add consistent, interpretable gains.

Hooks and stability checks: A six-month feature stability benchmark introduced the user-cohort shift ratio (R_shift). Baselines: a stable feature set S shifted ≈ 6% (quantile), ≈ 2% (binary). Feature 4 shifted ~50% (quantile) and ~20% (binary), clearly unstable; the hook thresholds (Binary ≤ 15%, Quantile ≤ 45%) blocked it. Result: high-lift but unstable candidates were filtered pre-search.

Pareto and persistence: In a controlled selection with four candidates, filtering out high-variance ones left a policy that stayed strong for a full month—evidence that the governance layer promotes policies that last, not just those that spike.

Real-world impact: Across diverse surfaces, GEARS improved key metrics (e.g., +0.04% to +0.37% depending on domain and metric). These sound small but are large at scale—tiny percentage lifts can represent huge aggregate benefits. Crucially, improvements generalized across many surfaces, showing GEARS is a reusable framework, not a single-surface trick.

Surprising findings:

Near-frontier candidates, admitted by tolerance filtering, often yielded better deployment stability than the single top candidate.
Feature stability hooks, set using realistic baselines (even “stable” features drift some), prevented shipping wins built on sand.

05Discussion & Limitations

Limitations:

Scope dependence: GEARS needs a well-formed search specification. If goals are vague or conflict without clear guardrails, the system explores too widely or picks compromises that please no one.
Data quality: Stability checks and HTE rely on good logging, trustworthy features, and sensible confidence intervals. Bad or shifting telemetry reduces reliability.
Cold starts and small data: With tiny cohorts or short windows, uncertainty grows; tolerance bands widen and selection may be conservative.
Computation and integration: Skills, Knowledge Brain, and governance pipelines require engineering effort, clean tooling, and disciplined documentation.
Cross-surface interference: A policy great for one surface might cannibalize another; multi-surface governance needs continued expansion.

Required resources:

Access to historical experiments, metrics, and confidence intervals.
An LLM backbone with tool execution, a Skills library (code/SQL + metadata), and a curated Knowledge Brain.
Governance infra for deterministic hooks, feature-stability tracking, and cohort consistency tests.
Compute for GAS candidate generation and periodic backtests.

When not to use:

Single-metric, low-stakes tweaks where a simple A/B and scalar ranking suffice.
Extremely volatile contexts where features or objectives change weekly; governance will keep rejecting everything.
Very small products without enough data; the overhead may outweigh benefits.

Open questions:

Can we learn Skills automatically from successful analyst workflows and code histories?
How to couple GEARS with online adaptive experimentation loops while preserving determinism and safety?
Theoretically, how do tolerance parameters relate to long-term stability guarantees?
How to extend governance to multi-surface optimization and detect cross-surface cannibalization early?
Can we quantify “vibe alignment” more directly, bridging qualitative intent and measurable constraints?

06Conclusion & Future Work

Three-sentence summary: GEARS turns ranking optimization from manual, one-shot tuning into an autonomous discovery loop that starts from high-level intent and ends with stable, shippable policies. It combines Specialized Agent Skills, a wide candidate search (with tolerance-based Pareto filtering), and deterministic governance hooks to avoid brittle, short-lived wins. Across many experiments and real product surfaces, it finds better trade-offs with stronger deployment reliability and less human toil.

Main achievement: Closing the deployment gap—building a system that doesn’t just find strong policies, but proves they will last by encoding expertise and enforcing stability checks throughout the lifecycle.

Future directions: Integrate tighter online adaptive experimentation; broaden governance to multi-surface interactions; auto-learn new Skills from analyst playbooks; strengthen the Knowledge Brain with richer causal signals; and refine tolerance schedules based on uncertainty calibration.

Why remember this: GEARS shows that the path to better AI decisions in complex products isn’t only smarter models—it’s smarter processes. By treating optimization as safe exploration, keeping near-best stable options, and baking in deterministic checks, GEARS points toward self-optimizing systems that are both flexible and operationally safe.

Practical Applications

•Translate a product owner’s natural-language goal into a concrete, testable search specification.
•Generate a wide set of cohort-treatment policies using HTE and retain near-best candidates for stability.
•Automatically audit feature stability over six months and exclude unstable signals before selection.
•Run guardrail interpretation Skills to ensure safety, quality, or fairness metrics are preserved.
•Use progressive disclosure to keep the agent’s reasoning focused and prevent context overload.
•Apply deterministic governance hooks to reject policies driven by noise or transient effects.
•Backtest selected policies for persistence and cohort consistency before rollout.
•Auto-generate a deployable ranking configuration with human-readable justifications.
•Continuously monitor post-launch drift and trigger refinement loops when thresholds are crossed.
•Scale the approach across multiple product surfaces while tracking cross-surface interactions.

Version: 1