AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Xianyang Liu; Shangding Gu; Dawn Song

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Beginner

Xianyang Liu, Shangding Gu, Dawn Song2/5/2026

arXiv

Key Summary

•AgenticPay is a safe playground where AI agents practice buying and selling by talking, not just by typing numbers.
•It lets many buyers and sellers chat over multiple rounds, each keeping secret limits on how much they’ll pay or accept.
•A special parser reads the chat and turns it into actions like price offers and deal acceptance, so results can be scored fairly.
•Scores check if a deal was possible, how balanced the price was for both sides, and how quickly the agents agreed.
•Across 111 tasks and 10 real-life scenarios, top proprietary models made faster, more balanced deals than smaller open models.
•All models did better as more buyers and sellers were added, because more choices made good matches easier to find.
•Sellers generally did better than buyers, showing a consistent role imbalance in current AI negotiation behavior.
•Open models often got very close to a deal but failed to make the final tiny concession before timing out.
•AgenticPay shows that great sounding language isn’t enough—long-horizon planning and strategy really matter.
•This benchmark gives researchers a common way to compare, improve, and safely deploy negotiation AIs in markets.

Why This Research Matters

Lots of everyday deals—from booking a vacation rental to choosing a software plan—are shaped by conversation as much as by numbers. AgenticPay helps us build AI that can negotiate fairly and clearly, not just talk smoothly. By revealing where models struggle, like making the final small concession, it guides improvements that lead to better outcomes for people. The benchmark’s focus on fairness and speed also encourages designs that save time and reduce frustration. Over time, this can help families find better prices, small businesses get good supplier terms, and marketplaces stay competitive and transparent. With careful guardrails, these systems can become helpful assistants that advocate responsibly for users.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a school fair where kids trade stickers. Some kids know the rare ones they really want and the lowest number of stickers they would accept for their own rare sticker—but they keep those numbers secret. They talk, make offers, and try to meet in the middle.

🥬 The Concept (Multi-agent systems): What it is: A multi-agent system is a team of computer programs (agents) that each make decisions and interact with one another. How it works: 1) Each agent has a role (buyer or seller), 2) Each observes the world and its own private info, 3) Each decides what to do next (like making an offer), 4) The team’s combined choices create outcomes (a deal or no deal). Why it matters: Without multiple agents, you can’t study real negotiations, because bargaining needs at least two sides talking and reacting.

🍞 Anchor: Think of a soccer game: each player (agent) has their own job, and the game only makes sense when everyone plays together.

🍞 Hook: You know how you convince a friend to trade you half of their sandwich by explaining why your snack is tastier? That’s negotiation with words, not just numbers.

🥬 The Concept (Large Language Models, LLMs): What it is: LLMs are computer programs that read and write human-like language. How it works: 1) They read what’s already been said, 2) Predict the next useful words, 3) Keep track of goals like getting a good price, 4) Adjust their messages each round. Why it matters: If agents can’t understand and produce language well, they can’t bargain clearly, explain reasons, or find fair middle points.

🍞 Anchor: When you ask a chatbot for homework help and it replies in clear sentences, that’s an LLM using language to help you.

🍞 Hook: Picture two kids swapping baseball cards. They don’t flash calculators; they talk: “It’s signed!” “It’s rare!” “Meet me at 5 cards?”

🥬 The Concept (Language-mediated negotiation): What it is: It’s bargaining done by talking in natural language instead of just posting numbers. How it works: 1) Each side states what they want, 2) They make offers and counteroffers, 3) They share reasons and constraints, 4) They agree or walk away. Why it matters: Real-life deals depend on explanations, trust, and back-and-forth—things pure numbers can’t capture.

🍞 Anchor: When you haggle at a yard sale and say, “It has a stain, can we do $5?”, that’s language-mediated negotiation.

🍞 Hook: Sometimes conversations feel a bit like chance—you can’t predict exactly what someone will say next.

🥬 The Concept (Stochastic language games): What it is: A language game where what people (or agents) say and do can vary unpredictably across rounds. How it works: 1) Agents take turns talking, 2) The next message depends on history and hidden info, 3) Randomness means the same setup can lead to different chats, 4) The game ends with a deal or timeout. Why it matters: Without modeling uncertainty, we’d over-simplify real talk and miss how slight changes lead to different outcomes.

🍞 Anchor: Two identical chess openings can lead to different middlegames—likewise, similar negotiations can diverge.

The world before this paper: Many AI tests checked single-step answers, simple math bids, or very short negotiations. They rarely had agents with hidden minimum/maximum prices or many buyers and sellers at once. That’s like grading a debate team on one-liners instead of a full debate.

The problem: We lacked a principled, language-first benchmark to test if LLM agents can truly negotiate across different markets, roles, and products over many rounds.

Failed attempts: Prior work used numeric auctions (easy to grade but missing the richness of talk), toy bargaining (too short, not general), or single-agent tests (no real opposition). They didn’t capture real-world features like private reservation values, competition, or product differences.

🍞 Hook: You know how a secret spending limit keeps you from overspending at the mall?

🥬 The Concept (Private reservation values): What it is: A buyer’s top price and a seller’s bottom price that they keep secret. How it works: 1) Each agent gets a hidden limit, 2) They negotiate without revealing it, 3) Offers should stay within the possible overlap, 4) A deal is only valid if it fits both limits. Why it matters: Without secrets, bargaining becomes trivial; with secrets, strategy and careful wording matter.

🍞 Anchor: If you won’t pay more than $20 for a game and the seller won’t accept less than$ 15, the sweet spot is $15–$ 20.

The gap: We needed a scalable, language-grounded testbed where agents with secrets could negotiate over many rounds, across many buyers/sellers/products, with clear scores for fairness and speed.

Real stakes: Online shopping, contractor bids, travel bookings, and business purchases all use talk, trade-offs, and timing. A better benchmark means safer, fairer AI negotiators—for example, helping a family find a fair rental price or a small shop get good supplier terms.

02Core Idea

🍞 Hook: Imagine organizing a giant swap meet where everyone talks through their deals, and a fair referee listens to every chat, writes down the actual offers, and scores how fair and fast each deal was.

🥬 The Concept (Dialogue-based action extraction): What it is: Turning free-form chat into structured actions like “offer $120” or “accept.” How it works: 1) Agents speak in normal sentences, 2) A parser finds the price tags in the exact format, 3) It records actions each round, 4) It checks if both sides match on a price to finalize the deal. Why it matters: Without extracting actions from words, we couldn’t grade who offered what, or decide when a deal really happened.

🍞 Anchor: It’s like a teacher listening to a debate and tallying each promise into a checklist.

The “Aha!” moment in one sentence: Treat negotiation as a language game with hidden limits, then grade the talks by how possible, fair, and fast the final deal is, across markets of all sizes.

Three analogies:

Board game night: Everyone plays by speaking moves; a scribe writes down the exact moves and declares when a player wins. AgenticPay is the table, the scribe, and the scoreboard.
Farmers’ market: Many buyers and many sellers chat about price and quality. AgenticPay is a safe copy of that market where we can test different AI “shoppers” and “vendors.”
School science fair: Each team presents (negotiates) over multiple rounds; judges (metrics) grade fairness (balanced price), feasibility (did a deal make sense?), and speed (how quickly they settled).

Before vs. after:

Before: Negotiation tests were small, number-only, or short. Hard to know if AI could truly talk through complex deals.
After: We have 111 tasks in 10 realistic scenarios, from one-on-one to many-to-many. We can compare AIs fairly and see where they stumble (like last-mile concessions).

🍞 Hook: You know how a ruler helps you measure a desk so everyone agrees on the size?

🥬 The Concept (Economic interaction metrics): What it is: Measurements that tell us if a deal was valid, fair, and efficient. How it works: 1) Check feasibility (is the price between buyer’s max and seller’s min?), 2) Check balance (did both sides get value?), 3) Check speed (fewer rounds is better), 4) Combine into clear scores. Why it matters: Without shared measurements, we’d argue forever about who negotiated better.

🍞 Anchor: It’s like a report card showing not just the grade, but also how fast and fairly you worked.

Why it works (intuition, no math):

Negotiation lives in language, not just numbers—so grading must listen to the chat but still pull out clean actions (offers, acceptances).
Secrets (reservation values) make strategy real; the best agents search for a win–win zone.
Clear, symmetric scoring encourages fair splits, not just winning at all costs, and rewards deals made sooner.

Building blocks:

Multi-agent roles (buyers, sellers), each with private limits.
A clean environment that gives public info (like product details) but hides private prices.
A strict but simple price format that the parser can always read.
A scoring system that rewards feasibility, fairness, and fewer rounds.

🍞 Hook: Picture a playground where the lines are drawn for soccer, the ball is ready, and a scoreboard tracks goals.

🥬 The Concept (Many-to-many market): What it is: A setup with multiple buyers and multiple sellers possibly across multiple products. How it works: 1) Everyone can talk to more than one partner, 2) Alternatives create competition, 3) Agents can choose, switch, or commit, 4) Better matches appear more often. Why it matters: Real markets aren’t just one-on-one; more options can lead to better, faster deals.

🍞 Anchor: At a busy swap meet, you can walk to the next table if a price feels off—and that pressure speeds up fair deals.

🍞 Hook: If there are many snack carts at recess, prices tend to be more reasonable because you have choices.

🥬 The Concept (Market liquidity): What it is: How easy it is to find a trading partner you like. How it works: 1) More participants, 2) More product options, 3) More chances to match preferences, 4) Faster, better deals. Why it matters: Low liquidity traps you in bad or slow negotiations; high liquidity gives you alternatives and leverage.

🍞 Anchor: If three friends want to trade cards and only one has what you need, it’s slow; if ten friends have options, you’ll finish fast with a better trade.

03Methodology

At a high level: Input (scenario, product, secret limits) → Multi-round chat (offers and reasons) → Parser extracts actions → Scoring system decides deal quality and speed → Output (scores for buyer, seller, and overall).

Step-by-step details:

Setup the world

What happens: The environment picks a scenario (like used phone, rental, SaaS), shows product info to both sides, and secretly gives the buyer a maximum price and the seller a minimum price.
Why this exists: Without public product info, talks are vague; without secret limits, strategy disappears.
Example: In a used iPhone task, both see “iPhone 14 Pro, 87% battery,” but only the buyer knows their max is $510, and only the seller knows their min is$ 480.

🍞 Hook: Imagine two kids each know their private budget but both can see the same toy and its features.

🥬 The Concept (Bargaining zone): What it is: The overlap between the buyer’s top price and the seller’s bottom price. How it works: 1) Compute buyer max minus seller min, 2) If positive, a deal is possible, 3) If zero or negative, no fair price exists, 4) Negotiation should search this overlap. Why it matters: Without knowing that a zone exists, you might waste time chasing impossible deals.

🍞 Anchor: If you’ll pay up to $20 and I’ll accept at least$ 15, the bargaining zone is $15–$ 20.

Talk in turns

What happens: Buyer and seller alternate messages. Each message must include exactly one price in a strict tag like “### BUYERPRICE($130) ###.”
Why this exists: The strict tag lets the parser reliably read prices from natural language.
Example: “I noticed summer discounts. How about ### BUYERPRICE($120) ###?”

Parse the talk into actions

What happens: A parser scans the message for the price tag, records each offer, and checks if both sides match on a price.
Why this exists: Free-form chat is messy; parsing turns it into clear steps (offer, counter, accept).
Example: It reads $120 from the buyer, then$ 140 from the seller, then $133 from both—deal!

Stop conditions

What happens: The conversation ends if both sides propose the same price, if they run out of allowed rounds, or if someone makes invalid offers (like outside the acceptable bounds).
Why this exists: Without firm stop rules, a chat might go on forever or accept unfair prices.
Example: If the buyer’s max is $500 and the seller’s min is$ 520, any overlap is impossible, so talks should end with no deal and a penalty.

🍞 Hook: When you split a bill with friends, you try to be fair and fast, not just finish somehow.

🥬 The Concept (GlobalScore, BuyerScore, SellerScore): What it is: Three grades—overall fairness/efficiency (Global), and each side’s personal benefit (Buyer, Seller). How it works: 1) Check if the deal is within the bargaining zone (feasible), 2) Measure how close the price is to the middle (balanced welfare), 3) Reward fewer rounds (efficiency), 4) Apply a small bonus for finishing and a penalty for failing. Why it matters: Without balanced grading, agents might game the system or drag talks out.

🍞 Anchor: It’s like a report card with an overall grade plus personal scores for each teammate.

Many types of markets

What happens: Tasks scale from 1 buyer–1 seller to many buyers–many sellers, and from single product to multiple products. The system supports parallel talks (multiple chats at once) or sequential talks (decide where to focus next).
Why this exists: Real life has competition and choices; testing only one-on-one misses the hard parts.
Example: In a many-to-many SaaS market, one buyer can compare offers from multiple sellers while each seller also courts multiple buyers.

🍞 Hook: You can text several friends at once (parallel) or plan one hangout at a time (sequential).

🥬 The Concept (Parallel vs. sequential negotiation modes): What it is: Two ways to schedule multiple talks. How it works: 1) Sequential: finish or pause one, then switch, 2) Parallel: keep several chats going, 3) Choose when to commit, 4) Manage attention and memory. Why it matters: Without planning the mode, an agent might lose track or miss better deals.

🍞 Anchor: Doing homework subject-by-subject (sequential) vs. juggling quick tasks across subjects (parallel).

Consistent prompts and roles

What happens: Both buyer and seller get structured prompts: product info, environment notes, a must-use price format, and secret limits not to reveal.
Why this exists: Standard prompts make comparisons fair across different models.
Example: The seller’s prompt says, “Never reveal your minimum acceptable price.”

Unified inference protocol

What happens: All models use the same decoding settings (like temperature 0) and the same max rounds and token limits.
Why this exists: If rules differ, results wouldn’t be comparable.
Example: Every model gets 20 turns max and must include exactly one price per turn.

The secret sauce:

Language-to-action grounding: Make agents talk naturally, but always emit a clean price tag that a parser can trust.
Symmetric fairness scoring: Reward deals that sit in the sweet spot for both sides, not just ‘win-big/lose-big.’
Scalable market designs: From simple one-on-one to full markets with many participants and products, all under the same playbook.

04Experiments & Results

The test: Researchers ran 111 tasks across 10 real-world scenarios (used phone, used car, vacation rental, website dev, photography, home renovation, SaaS, raw materials, luxury watch, business acquisition). They measured: overall fairness and speed (GlobalScore), each side’s personal benefit (BuyerScore, SellerScore), whether deals completed, how often talks timed out, and how many rounds it usually took.

The competition: Five models stood in as both buyers and sellers under identical rules. Three were strong proprietary models (Claude Opus 4.5, GPT-5.2, Gemini-3-Flash) and two were smaller open models (Qwen3-14B, Llama-3.1-8B).

The scoreboard (with context):

Claude Opus 4.5 led with a GlobalScore around 86.9 and 100% deal rate, like getting an A in a tough class and never missing a homework.
GPT-5.2 scored ~81.7 with 100% deals, a strong A-/B+ with perfect attendance.
Gemini-3-Flash scored ~82.2 and also closed deals reliably, another A-range performance.
Qwen3-14B scored ~63.9 with about 20.7% timeouts, more like a C+ where one in five talks ran out of time.
Llama-3.1-8B scored ~32.5 with about half the talks timing out, like struggling to finish most assignments.

Speed vs. skill: Better models finished faster (3.7–4.8 rounds on average), while weaker models took many turns (up to ~15) or gave up. This shows that strong language + good strategy means finding the fair price quickly.

Role imbalance: Everyone was better at selling than buying. Even top models had higher SellerScores than BuyerScores. This hints that current training may favor persuasive selling language more than careful buying tactics.

Scenarios matter: Financial asset deals (luxury watch, business acquisition) were hardest, especially for mid-tier models. These require careful risk reasoning and market awareness, which stretched the models.

More players helped: Surprisingly, adding more buyers and sellers often improved outcomes. With more choices (higher market liquidity), agents found better matches and made reasonable offers sooner—like how more snack carts keep prices fair.

Near-miss failures: Open models often got within a few dollars of agreement but timed out without making the final tiny concession. This shows a “last mile” problem: not language understanding, but strategic finishing.

Parallel vs. sequential: Top models did well in both. Some open models slightly improved in parallel mode but risked more rule mistakes (like price overflows). That suggests stronger models manage multiple threads of thought better.

Surprising findings:

Complexity can help: Many-to-many markets nudged agents toward fairer, faster deals thanks to alternatives.
Persona effects: An aggressive seller could produce balanced outcomes with patient buyers but lopsided ones with rushed buyers. Buyer styles like “Busy Professional” tended to concede too early and scored worse.
Perfect language isn’t enough: The biggest gaps were about planning, patience, and making the final compromise at the right time.

05Discussion & Limitations

Limitations:

Long-horizon strategy is hard: Weaker models struggled to stay focused across many turns, stalling right before agreement.
Buyer weakness: All models negotiated better as sellers, exposing a role bias that needs attention to avoid unfair markets.
Tough domains: Financial assets stressed models’ risk and value judgment, causing big score drops in mid-tier systems.
Parser dependence: The system needs a strict price tag to extract offers. Real-world chats can be messier, so robust extraction beyond fixed tags is a next step.

Required resources:

Consistent prompts and decoding settings to ensure fair comparisons.
Enough compute to run multiple multi-round chats (especially for open models on GPUs).
A library/runtime (like vLLM or SGLang) or cloud APIs to host the models.

When NOT to use:

One-shot pricing with no talk (a posted-price store) doesn’t need a language negotiation benchmark.
High-stakes, real-money deployments without human oversight—especially where vulnerable users might be pressured—until safety audits and guardrails are in place.
Domains where offers include many complex, non-price terms (warranties, delivery windows, penalties) not modeled yet.

Open questions:

Can we train buyers to be as strong as sellers, reducing role asymmetry?
How do we best teach the “last-mile concession” so near-miss failures drop sharply?
Can we generalize beyond fixed tags to robustly parse free-form offers without losing reliability?
How can we incorporate richer contracts (bundles, delivery terms) while keeping evaluation fair and simple?
What safety rules and transparency tools best protect human users when AI negotiators go live?

06Conclusion & Future Work

Three-sentence summary: AgenticPay is a large, language-first playground where AI buyers and sellers with secret limits negotiate across 111 tasks and 10 real-life scenarios. It turns chats into clean actions, then scores feasibility, fairness, and speed to reveal what today’s models can and cannot do. Results show strong proprietary models make fast, balanced deals, while smaller open models often stall right before agreement—so strategy and patience matter as much as fluent words.

Main achievement: A scalable benchmark that grounds talk in measurable actions, supports markets from one-on-one to many-to-many, and provides clear, comparable scores that highlight real negotiation skills, not just pretty language.

Future directions: Strengthen buyer strategies, teach last-mile concessions, expand parsing beyond fixed tags, add richer contract terms and risk reasoning, and build guardrails for safe, fair deployment alongside people.

Why remember this: Because real-world deals are conversations, not just numbers. AgenticPay shows how to fairly test conversation-powered agents at market scale, so tomorrow’s AI negotiators can be not only eloquent—but also fair, fast, and truly helpful.

Practical Applications

•Train buyer-side AI assistants to seek fair prices without overpaying in online marketplaces.
•Coach seller-side agents to close deals faster while keeping offers within valid bounds.
•Support procurement teams comparing multiple vendors in parallel and summarizing best options.
•Help vacation renters or hosts negotiate reasonable total prices including fees.
•Assist small shops in finding better raw-material deals through many-to-many negotiations.
•Benchmark and compare new LLMs for negotiation readiness before deploying them to users.
•Stress-test agent strategies on hard domains like financial assets to improve risk reasoning.
•Develop safety guardrails that prevent revealing secret reservation prices or making unfair offers.
•Design teaching tools that show students how negotiation works with fair, fast outcomes.
•Prototype marketplace features that detect near-miss deals and suggest tiny final concessions.

Version: 1