CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Jinpeng Chen; Cheng Gong; Hanbo Li; Ziru Liu; Zichen Tian; Xinyu Fu; Shi Wu; Chenyang Zhang; Wu Zhang; Suiyun Zhang; Dandan Tu; Rui Liu

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Intermediate

Jinpeng Chen, Cheng Gong, Hanbo Li et al.3/2/2026

arXiv

Key Summary

•CoVe is a way to create training conversations for AI agents that use tools, while guaranteeing the conversations are both challenging and correct.
•It starts by picking exact, solvable goals (constraints) from a sandbox database, then rewrites them in fuzzy, human-like language to make the task realistically ambiguous.
•A user-simulator model chats with the agent, revealing needs step by step, while the agent asks clarifying questions and calls tools (APIs).
•After the chat ends, the original exact goals act like a checklist to verify if the agent’s tool calls truly did everything required, with no extra, pointless steps.
•Only perfectly verified conversations become training data for supervised fine-tuning; the same checklist can also give exact rewards for reinforcement learning.
•On the tough τ-bench Airline and Retail domains, a compact 4B-parameter CoVe model beats similarly sized baselines and comes close to models up to 17× larger.
•Carefully verified, zero-noise data from CoVe can outperform far larger but noisier datasets, showing quality beats quantity.
•Sequential SFT+RL underperformed SFT alone due to a weaker, single user simulator during RL, hinting that better simulators can unlock even stronger results.
•The team open-sourced the 4B model, code, and 12K verified trajectories so others can build reliable interactive agents.

Why This Research Matters

CoVe shows how to turn messy, human-style requests into reliable, verified training for AI helpers that actually do things. This leads to agents that can change flights, process returns, or update accounts without making dangerous guesses. By guaranteeing correctness with a deterministic checklist, CoVe reduces costly errors and policy violations. It also proves that excellent training data can make small models punch far above their weight, which saves compute and money. As companies deploy agents to handle real customers, CoVe’s approach helps ensure those agents are careful, accurate, and consistent. The open-source release means the community can replicate and improve the method. Over time, this could make everyday AI assistants safer and more trustworthy.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re asking a store helper for a return, but you don’t remember your order number. You say, “It’s the order with the blue shirt and the leather shoes.” A great helper asks the right questions, finds the exact order, and follows store rules perfectly.

🥬 The Concept: Interactive tool-use agents are AIs that chat with you over several turns, ask clarifying questions, and then call tools (like company APIs) to get things done. They must turn your fuzzy words into exact, step-by-step actions that computers can run.

How it works: 1) You state a need (often vague). 2) The agent asks follow-ups. 3) The agent calls tools with precise arguments. 4) The agent checks results and continues until you’re satisfied. 5) The chat ends when everything is done.
Why it matters: Without this, AIs either guess wrong or can’t finish tasks that require back-and-forth, like travel changes or retail returns.

🍞 Anchor: Booking flights with seat changes needs follow-up questions (dates, airports, seat type) and exact tool calls. That’s what interactive tool-use agents aim to handle.

The world before: Early tool-use work showed LLMs could call APIs, but most tasks were one-shot or non-interactive. Real people, however, talk in puzzle pieces. They forget IDs, mix details, or change minds. Meanwhile, tools are strict: they demand exact IDs and valid arguments. This mismatch—human fuzziness vs. machine precision—left a big gap for multi-turn, tool-using conversations.

The problem: Training these agents requires lots of example conversations where the agent both chats and uses tools correctly. Human-made data is expensive. Auto-generated data from LLMs is cheaper but often flawed: the tasks might be unsolvable, the checking might be wrong, and the dialogs are often too short or too simple, so the agent never learns the tricky parts.

Failed attempts: Many pipelines asked an LLM to invent conversations and then asked another LLM to grade them. But LLM graders can hallucinate, approve bad calls, or fail to catch missing steps. Also, LLMs tend to write easy tasks: few tool calls, few turns, and little ambiguity. That’s like training a basketball team only on layups—then wondering why they struggle in real games.

The gap: We needed a way to 1) guarantee tasks are solvable, 2) make conversations realistically messy and multi-turn, and 3) verify tool-use with certainty, not guesses. In other words, a system that grows data complexity while locking in correctness.

🍞 Hook: You know how a pilot uses a checklist so nothing important gets missed?

🥬 The Concept (Deterministic Checklist): A deterministic checklist is a set of exact goals every finished task must satisfy, used to verify success with no guessing.

How it works: 1) Define exact goals (cancel order X, return item Y). 2) Run the conversation. 3) Compare the tool results to the goals. 4) Mark correct if all goals are met and no extra, irrelevant actions occurred.
Why it matters: Without a deterministic checklist, verification can be fuzzy and let errors slip through.

🍞 Anchor: If the goal says “Cancel order #W6289991,” the agent must actually cancel that order—no excuses and no canceling the wrong one.

Real stakes: People want reliable AI helpers for customer support, travel, banking, and more. If agents guess, skip steps, or do extra actions, they can break policies, waste time, or even cause losses. Great training data—with real ambiguity plus exact verification—means agents you can trust in daily life.

🍞 Hook: When you do a school project, starting with clear requirements makes it easier to know if you did it right.

🥬 The Concept (τ-bench): τ-bench is a testbed where agents must talk with users and call tools across domains like Retail and Airline.

How it works: 1) Provide realistic environments and tools. 2) Present ambiguous needs. 3) Score success across multiple attempts for stability.
Why it matters: Without strong, realistic tests, we can’t tell if agents truly handle messy, real-world tasks.

🍞 Anchor: τ-bench is like a practice field that looks and feels like a real game, so coaches see if the team’s skills hold up.

02Core Idea

🍞 Hook: Think of building Lego models. If you start with a clear picture (the box art) and then ask a friend to only give you hints like “use the red 2x4 brick,” you’ll still end up with the same exact model if you follow the plan.

🥬 The Concept (CoVe): CoVe is a way to train interactive tool-use agents by starting from exact goals (constraints), turning them into fuzzy, human-like requests, and then checking the agent’s tool actions against the original exact goals.

How it works: 1) Sample exact, solvable goals from a sandbox database. 2) “Fuzzify” these goals into human-style hints (no IDs at first). 3) A user-simulator chats with the agent, revealing needs gradually. 4) After the chat, the original goals act as a strict checklist to verify success and spot extra, useless actions. 5) Keep only perfect conversations for supervised fine-tuning or use the checklist as a reward for reinforcement learning.
Why it matters: Without CoVe, automatically made training data can be too easy or wrong. CoVe makes it real and guarantees correctness.

🍞 Anchor: If the exact goal is “Return product 8310926033 from order #W2021911,” the fuzzy version might be “Return the chair from the order that had a table and a chair.” The checklist later confirms the agent truly returned the right item from the right order.

The “Aha!” in one sentence: Start from exact goals so solvability and verification are guaranteed, then add fuzziness to force real conversational reasoning, and finally grade with the exact goals to keep only flawless training data.

Three analogies:

Treasure map: The X marks (constraints) are exact. You tell a friend only vague hints to make it interesting (fuzzification). After the hunt, the map decides if you truly found the treasure (verification).
Cooking with a rubric: The final dish must taste a certain way (constraints). You get flexible instructions like “add spices to taste” (fuzzification). Judges use the rubric to score the outcome precisely (verification).
Math homework: The teacher picks solvable problems (constraints). Word problems hide the numbers a bit (fuzzification). The answer key checks if you solved every part correctly (verification).

Before vs. after:

Before: Auto-generated dialogs were often too simple and checked by another LLM that could be wrong.
After: Dialogs are complex and realistic, but everything is still provably correct because we verify against the original exact constraints.

Why it works (intuition):

Start with truth: If goals come from a real database state, the task is guaranteed solvable.
Add realism: Fuzzification mimics how people really talk, forcing the agent to ask questions and use tools smartly.
End with certainty: A deterministic checklist can’t be fooled by fancy wording; it only cares if the right things happened.

Building blocks (each in Sandwich form):

🍞 Hook: You know how a coach picks drills that your team can actually do on today’s field? 🥬 The Concept (Constraint Sampling): Pick exact, solvable goals from the sandbox so every task can be completed without guessing.

How it works: 1) Read the database. 2) Choose goals like “cancel order A,” “return item B.” 3) Use exact IDs so they’re uniquely identified.
Why it matters: Without this, you might create impossible tasks. 🍞 Anchor: “Cancel order #W6289991” is a precise target pulled from the database.

🍞 Hook: People rarely remember IDs; they say “the order with the red laptop.” 🥬 The Concept (Constraint Fuzzification): Rewrite exact goals into human-style clues that still uniquely point to the same targets.

How it works: Replace IDs with descriptions (subset of items, last four digits of a card, default address, etc.) while ensuring uniqueness.
Why it matters: Without fuzziness, the agent just copies IDs; with it, the agent must reason and clarify. 🍞 Anchor: “The order with shoes and clothing” uniquely identifies order #W6289991.

🍞 Hook: Teachers grade with answer keys so scores are fair. 🥬 The Concept (Rule-based Verification with a Deterministic Checklist): Use the original constraints as a strict checklist to confirm every requirement was met and to penalize extra, irrelevant actions.

How it works: Parse tool calls, confirm outcomes match goals, and count redundancies.
Why it matters: Without rules, a grader could be fooled; with rules, correctness is crystal clear. 🍞 Anchor: If the goal is to return item 8310926033, only a tool call that returns that exact item counts.

🍞 Hook: Practicing with perfect examples is the fastest way to learn. 🥬 The Concept (SFT and RL with CoVe): Use only perfectly verified dialogs for supervised fine-tuning, or use the checklist score as a reward in reinforcement learning.

How it works: SFT keeps only score-1 dialogs; RL gets a reward equal to the checklist score.
Why it matters: Training signals stay clean and strong. 🍞 Anchor: The model copies steps from perfect dialogs in SFT; in RL, it explores and gets rewarded for satisfying all constraints.

03Methodology

High-level recipe: Inputs → [Sample exact constraints] → [Fuzzify into human-style clues] → [Simulate multi-turn chat + tool calls] → [Deterministic checklist verification] → Outputs (perfect SFT data and exact RL rewards)

Step A: Constraint Sampling

What happens: The system looks into a sandbox database (like a fake but realistic store or airline). It selects a set of goals that are definitely valid given the current records. For example, “Cancel order #W6289991,” “Return product 8310926033 in order #W2021911.”
Why this step exists: It guarantees solvability. If the data says the order exists and can be canceled, the task is real—not imaginary.
Example: Retail domain picks two constraints: (1) Cancel order #W6289991, (2) Return product 8310926033 from order #W2021911.

Step B: Constraint Fuzzification

What happens: The exact IDs are hidden behind uniquely identifying descriptions, so the conversation feels human. Order IDs become “the order with a red gaming laptop and other items” or “the order with shoes and clothing.” Item IDs become “the chair,” possibly with an attribute like color. Payment IDs might become “the Visa card ending in 1234.” User IDs might become an email.
Why this step exists: If you show IDs, the agent just copies them and calls the API. Fuzziness forces the agent to ask clarifying questions or to call lookup tools.
Example: “Return product 8310926033 in Order #W2021911” becomes “Return the chair from the order with a table and a chair.” Uniqueness checks ensure there’s only one such match.

Step C: Multi-turn Trajectory Generation with a User Simulator

What happens: A user-simulator LLM pretends to be a real customer. It reveals needs gradually, responds naturally, and signals completion with a special token like ###STOP### when all needs are met. The agent asks clarifying questions and makes tool calls. The dialogue plus tool history is the trajectory.
Why this step exists: Real users rarely tell you everything at once. Training must reflect that to teach the agent robust, step-by-step reasoning.
Example: The simulator starts with “I need help with a recent order,” later adds “It’s the one with shoes and clothing,” and finally asks “Please cancel that order and also return the chair from another order.” Meanwhile, the agent uses tools like cance $l_p$ endin $g_o$ rder(orde $r_i$ d) and retur $n_d$ elivere $d_o$ rde $r_i$ tems(orde $r_i$ d, ite $m_i$ d).

Step D: Rule-based Deterministic Verification

What happens: After the chat ends, a verifier reads the tool calls and outcomes, compares them to the original constraints, and computes a score. It checks both constraint satisfaction (did all required goals happen?) and redundancy (were there extra, unnecessary operations?).
Why this step exists: This removes grader hallucinations. Success is not about pretty words; it’s about whether the correct state changes actually occurred.
Example: If the constraints say cancel #W6289991 and cancel #W1999528 and return product 8310926033 in order #W2021911, then a trajectory that cancels #W6289991 and returns the right product but also cancels #W4121151 is marked as partially correct with redundancy, not perfect.

Step E: Post-training with SFT or RL

SFT: Keep only trajectories with a perfect score (all goals met, no redundancy). Train the student model to imitate these gold-standard steps.
RL: Let the student model interact and explore. When a trajectory finishes, use the verifier’s score as the reward signal to improve the policy.
Why this step exists: Clean SFT data prevents the model from learning mistakes. Precise RL rewards guide exploration without noise.
Example: The SFT dataset contains only fully correct, non-redundant airline and retail conversations. In RL, if the agent meets 2 of 3 goals and makes one extra call, its reward reflects that numerically.

Secret Sauce (what makes it clever):

Start from constraints (truth) so tasks are solvable.
Add fuzziness (realism) so the agent must converse and reason.
End with deterministic verification (certainty) so the training data is provably correct.
Reuse the same verifier for both SFT filtering and RL rewards—one core engine powering two training modes.

Concrete mini-walkthrough (Retail):

Sample constraints: {Cancel #W6289991; Return item 8310926033 from #W2021911}.
Fuzzify: {“Cancel the order with shoes and clothing”; “Return the chair from the order with a table and a chair.”}
Simulate chat: User simulator starts vague, agent asks for details or uses a lookup tool by item composition, finds exact orders, confirms with the user, executes cancellations/returns.
Verify: Did the actual tool calls cancel #W6289991 and return item 8310926033 from #W2021911? Were there extra, irrelevant calls? Score it.
Train: Keep only perfect cases for SFT; use scores as rewards for RL.

Concrete mini-walkthrough (Airline):

Sample constraints: {Change seat to aisle on flight X; Update contact to email Y}.
Fuzzify: {“I want a seat that’s easy to get up from” (aisle); “Use my main email (the one on my profile).”}
Simulate chat: Agent confirms the flight, checks seat map, offers aisle choices, updates contact.
Verify: Ensure the exact flight’s seat is aisle and contact email matches Y; penalize unrelated changes.
Train: As above.

04Experiments & Results

The Test: The team evaluated on τ-bench’s Airline and Retail domains. These domains require back-and-forth conversations with tool calls under real policies. They used pass@k metrics: pass@1 to pass@4 measure the chance the agent succeeds in all k consecutive tries—so higher k means you’re not just lucky; you’re stable.

🍞 Hook: You know how getting an A once is great, but getting As four times in a row proves you really know it? 🥬 The Concept (pass@k): pass@k checks if the model can complete the task k times in a row, showing consistency.

How it works: Run the same task multiple times independently; count success only if all k runs succeed.
Why it matters: It filters out lucky wins and rewards steady performance. 🍞 Anchor: If an agent solves a return once (pass@1) but fails the next time, it won’t score on pass@2.

The Competition: CoVe-4B (4B parameters) was compared with similarly sized open-source models (3B–8B), mid-scale models (~30B), and very large ones (70B to 235B and proprietary). Baselines included Qwen3-4B-Instruct-2507, Simia variants, xLAM-2 series, GEM, and giants like Qwen3-235B and GPT models.

The Scoreboard with context:

Average pass@1: CoVe-4B reached 51.2% across Airline and Retail. That’s like getting an A- while other 4–8B students often got B- or C+.
By domain: Airline pass@1 was 43.0%; Retail pass@1 was 59.4%. Retail was consistently easier across models.
Versus base: CoVe-4B beat its base Qwen3-4B-Instruct-2507 by +18.6 points (32.6% to 51.2%). That’s a big leap for the same-sized student.
Versus bigger models: CoVe-4B outperformed some ~30B models and came close to a 70B model (51.2% vs 51.5% average pass@1). It narrowed the gap with massive 235B and proprietary models to under 5 points in average pass@1.
Stability (higher k): CoVe held up better than many peers as k increased, showing that its wins weren’t just lucky one-offs.

Ablations and findings:

Data quality beats size: CoVe-5K SFT data slightly outperformed a 90K noisier dataset (44.7% vs 44.3% pass@1). Scaling CoVe to 12K pushed performance to 51.2%. Clean, verified dialogs trump raw volume.
Training paradigm: Pure SFT using CoVe data achieved 51.2%. Pure RL improved to 40.7% over baseline. But SFT+RL dropped to 46.9% because the online RL stage used a single, weaker user simulator, causing overfitting to its quirks.
User simulator choice matters: Gemini-3-Pro produced higher-yield, correctly terminated dialogs (74% success) than a weaker simulator, highlighting the importance of a capable, well-prompted simulator.

Surprises:

A carefully verified 4B student can rival or beat much larger models on these domains. That’s rare and exciting—it shows that great data curation can substitute for sheer parameter count.
Adding RL after strong SFT hurt in this setup, not because RL is bad, but because the online environment was too narrow, suggesting that better simulators or multi-simulator mixes could flip this result.

Bottom line: CoVe’s constraint-to-fuzziness-to-checklist pipeline produces training data that makes small models act much bigger, especially in complex, multi-turn, tool-using conversations.

05Discussion & Limitations

Limitations:

RL environment bottleneck: Using a single, weaker user simulator during online RL caused overfitting and degraded SFT+RL performance.
Domain coverage: Results are shown for Airline and Retail; broader domains (like Telecom) and other benchmarks remain to be tested.
Simulator timing: Some simulators end conversations too early or misread clarifications, reducing usable data.
Policy equivalences: Verifier rules must encode domain policies (e.g., what counts as a valid cancellation). If policies change, rules must be updated.

Required resources:

A sandbox database matching the target domain’s style and rules.
Tool (API) definitions and access for the agent.
A verifier implementation that parses tool calls and outcomes.
One or more capable user simulators; stronger simulators improve yields and realism.
Compute for SFT and (optionally) RL rollouts.

When not to use CoVe:

If you cannot define clear, checkable constraints (e.g., tasks judged purely by subjective style), deterministic verification won’t fit.
If you lack any tool APIs or structured environment, the checklist can’t verify concrete state changes.
If your domain changes policies too frequently to maintain verifier rules, upkeep cost may outweigh benefits.

Open questions:

How to best mix multiple simulators (or train a dedicated one) to maximize diversity without introducing noise?
Can we automatically learn or adapt verifier rules as domain policies evolve?
What’s the optimal balance of SFT and RL when simulators are strong and varied?
How well does CoVe transfer to new domains (e.g., Telecom) and broader benchmarks like BFCL?
Can we incorporate uncertainty handling (e.g., partial information) while keeping deterministic verification for the parts that can be checked?

06Conclusion & Future Work

Three-sentence summary: CoVe trains interactive tool-use agents by starting from exact, solvable goals, fuzzifying them into human-like requests, and then verifying the agent’s tool actions with the original goals as a checklist. This creates complex but guaranteed-correct training dialogs for SFT and gives precise rewards for RL. The result is small models that perform like much larger ones on tough, multi-turn tasks.

Main achievement: Proving that constraint-guided, deterministically verified data synthesis can reliably produce high-quality, complex training trajectories that dramatically lift small-model performance on τ-bench.

Future directions: Strengthen user simulators (or train a specialized one), expand to more domains (e.g., Telecom), refine verifier rules for evolving policies, and revisit SFT+RL with richer simulators and environments to surpass SFT alone.

Why remember this: CoVe shows that data you can trust—complex, realistic, and provably correct—can matter more than raw model size. It offers a practical, open-source recipe for building reliable conversational agents that don’t just talk—they get real work done with tools, consistently and correctly.

Practical Applications

•Build retail support bots that accurately process cancellations, returns, and exchanges using verified training dialogs.
•Train airline assistants to handle seat changes, refunds, and contact updates with precise, policy-safe tool calls.
•Create banking or billing agents that confirm identities and update records using fuzzified requests and strict verification.
•Develop IT helpdesk bots that gather device details conversationally and run exact diagnostic or reset tools.
•Teach healthcare admin agents to schedule appointments and verify insurance details with deterministic checks.
•Improve government service portals where agents must follow strict rules while clarifying citizen requests.
•Upgrade enterprise workflow assistants that coordinate between multiple internal APIs with zero redundant actions.
•Prototype customer journey agents that can ask for missing info and still meet exact backend requirements.
•Use CoVe’s verifier as a reward function to safely explore new tool-use strategies in RL without noisy grading.
•Generate diverse, high-quality training data for new domains (e.g., telecom) by defining domain-specific constraints and fuzzification rules.

Version: 1