DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu; Shaoyuan Chen; Yinmin Zhong; Rilin Huang; Yixuan Tan; Wentao Zhang; Liyue Zhang; Shangyan Zhou; Yuxuan Liu; Shunfeng Zhou; Mingxing Zhang; Xin Jin; Panpan Huang

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Intermediate

Yongtong Wu, Shaoyuan Chen, Yinmin Zhong et al.2/25/2026

arXiv

Key Summary

•Agent-style LLMs chat with tools over many short turns, so most tokens are repeats and the system spends more time fetching old memories (KV-Cache) than computing new answers.
•Today’s common setup makes prefill GPUs read huge KV-Cache files from storage, choking their single storage NIC, while decode GPUs’ storage NICs sit mostly idle.
•DualPath adds a second road for loading KV-Cache: storage → decode GPU → fast compute network → prefill GPU, so all storage NICs help together.
•A careful traffic manager sends all non-urgent KV moves over low-priority lanes and keeps model communications on high-priority lanes, so generation stays snappy.
•A smart scheduler balances who reads from where and who computes what, so neither GPUs nor NICs get overloaded.
•Layerwise prefill streams only one layer of KV at a time, fitting big jobs into GPU memory and overlapping data moves with compute.
•Across real agent traces and multiple models, DualPath sped up offline throughput by up to 1.87× and boosted online serving capacity by about 1.96× without breaking latency SLOs.
•The gains get bigger with longer contexts and many concurrent agents because storage I/O, not compute, dominates these workloads.
•DualPath scales to over a thousand GPUs with near-linear performance while keeping the scheduler lightweight.
•It works with standard datacenter fabrics (InfiniBand or RoCE) using QoS to safely share the fast compute network.

Why This Research Matters

Agent-style AI is becoming common—from coding assistants that run tests every few seconds to support bots that browse and summarize pages over long chats. DualPath turns a hidden storage choke point into a coordinated, two-lane system so these agents respond sooner and handle more users per cluster. That means lower cost per query and better use of the hardware you already own, instead of just buying more GPUs. It also reduces energy waste by keeping GPUs productively busy rather than idling while waiting on storage. Because DualPath protects latency-critical communication with network QoS, users still experience snappy token streams. Finally, its design works on today’s datacenter fabrics (InfiniBand or RoCE), making it practical to deploy without exotic hardware.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a classroom where most kids already know last week’s lesson. Each day, they only learn a tiny new piece, but the teacher still has to flip through the whole notebook to remember where everyone left off.

🥬 The Concept (Agentic LLMs): Agentic LLMs are AI systems that solve tasks across many short back-and-forth turns, often calling tools like a browser or Python during each step.

How it works: 1) The model keeps a long memory of the conversation. 2) Each turn adds a small chunk. 3) The model reuses almost all of the old memory to decide the next step.
Why it matters: Without reusing memory, it would constantly recompute everything, becoming slow and wasteful.

🍞 Anchor: A coding assistant that runs tests, reads error logs, tries a fix, and repeats hundreds of times—most of its context stays the same each turn.

🍞 Hook: You know how a bookmark helps you quickly find your spot in a book? A cache is like a bookmark for computers.

🥬 The Concept (Cache, then KV-Cache):

What it is: A cache stores useful results so we don’t redo the work. In LLMs, the KV-Cache stores attention keys and values for already-seen tokens.
How it works: 1) When the model reads tokens, it computes keys (K) and values (V). 2) It saves them in the KV-Cache. 3) Next turn, it loads these instead of recomputing. 4) It only computes for the few new tokens.
Why it matters: Without KV-Cache, multi-turn agents would crawl because they’d repeat old work every turn.

🍞 Anchor: If the story has 30,000 old tokens and you add 300 new ones, the model can reload the 30,000 from the KV-Cache and only compute for 300.

🍞 Hook: Think of cooking: prep (washing, chopping) is different from cooking (stirring, seasoning). Doing both at once in one tiny sink creates a mess.

🥬 The Concept (Prefill vs. Decode and PD Disaggregation):

What it is: Prefill prepares the model with the prompt; decode generates tokens one by one. PD disaggregation runs prefill and decode on different GPUs.
How it works: 1) Prefill GPU batches many prompts, loads their KV-Cache, and prepares state. 2) It hands the prepared state to the decode GPU. 3) Decode GPU focuses on fast, low-latency token generation.
Why it matters: Without separating, prefill and decode fight for the same resources, slowing both.

🍞 Anchor: One kitchen station chops veggies (prefill); another simmers soup (decode). Passing cleanly prepped veggies keeps the soup station fast.

🍞 Hook: Picture two roads: a slow but necessary warehouse road (storage network) and a fast highway just for kitchens (compute network). If everyone uses only the warehouse road, traffic jams.

🥬 The Concept (Storage vs. Compute Networks and the Bottleneck):

What it is: Servers have a storage NIC (SNIC) to read/write disks and a compute NIC (CNIC) to talk GPU-to-GPU on a fast network; they’re separated so traffic doesn’t collide.
How it works: Today, prefill GPUs load huge KV-Cache from storage via their single SNIC; decode GPUs’ SNICs sit idle. The CNIC highway is fast but mostly used for brief bursts of model communication.
Why it matters: Without using all available SNICs and the fast highway wisely, the prefill SNIC becomes the chokepoint and GPUs idle.

🍞 Anchor: In many systems, prefill’s storage road is packed, while decode’s storage road has open lanes—and the fast highway is mostly free between tiny rushes.

🍞 Hook: If your front door is jammed, it helps to have a side door plus a hallway inside to pass boxes around.

🥬 The Concept (The Problem Before This Paper):

What it is: Prefill engines alone fetch KV-Cache from storage, saturating their SNICs and starving the rest of the system.
How it works: 1) Long contexts + tiny new text each turn = very high KV-Cache reuse. 2) Lots of reading from storage, not much new compute. 3) Prefill SNIC hits its bandwidth limit; decode SNICs wait.
Why it matters: Without fixing this imbalance, adding more GPUs won’t help—throughput stays stuck at the SNIC limit.

🍞 Anchor: Even with powerful GPUs, if one narrow door feeds the whole class their books, the lesson can’t start faster.

Failed attempts and gap:

People tried big DRAM KV pools to avoid disk (fast but pricey, and memory is needed for other things in RL). Others shrank KV or sped up single-path I/O. None used decode-side SNICs to help prefill.
Missing piece: A way to load KV-Cache through multiple paths and coordinate traffic so nothing interferes with model latency.

Real stakes:

Faster coding agents, smoother help-desk bots, cheaper RL rollouts, better throughput with existing hardware, and lower energy use. When the system frees itself from the storage chokepoint, everyone gets answers sooner and at lower cost.

02Core Idea

🍞 Hook: Imagine a school with many doors. If everyone uses only one door, you get a traffic jam. If you open a second door and guide students smartly, the crowd flows.

🥬 The Concept (Aha! DualPath in one sentence): DualPath loads KV-Cache through two coordinated paths—either storage→prefill or storage→decode→prefill over the fast compute network—so all storage NICs and the compute highway help together without slowing model communication.

How it works (recipe):

Watch both kinds of NICs and GPU loads. 2) For each request, pick whether prefill or decode should read from storage. 3) If decode reads, it forwards KV-Cache to prefill over the compute highway (RDMA). 4) Use network QoS so KV moves never cut in line ahead of model comms. 5) Stream layer-by-layer to fit memory and overlap moves with compute.

Why it matters: Without a second path and smart traffic rules, the single prefill storage NIC stays jammed, and GPUs waste time waiting.

🍞 Anchor: It’s like letting parents drop kids at either the front or side entrance, then using a wide hallway to distribute them to classrooms.

Three analogies:

Roads: Two roads from the warehouse—one straight to the prep kitchen, one to the serving kitchen then across a fast hallway.
Buckets and hoses: One faucet isn’t enough; open a second faucet and pass water through a big pipe inside.
Library carts: If one librarian’s cart is full, another librarian fetches books and hands them off via a staff-only corridor.

Before vs After:

Before: Prefill’s single SNIC “door” limits the whole system; decode’s SNIC idle; compute network underused.
After: Both prefill and decode SNICs pull from storage, and the compute network ferries KV-Cache internally with QoS, so no one door jams the day.

Why it works (intuition):

Pool the scarce thing (storage bandwidth) across all nodes by adding a second path.
Keep the fast thing (compute communications) pristine using QoS lanes—KV traffic uses leftover room only.
Balance both compute and NIC usage with a global scheduler; stream per-layer so memory never explodes.

Building blocks (each in Sandwich style):

🍞 Hook: You know how you can choose the shorter grocery line? 🥬 Dual-Path KV-Cache Loading:

What: Two ways to fetch KV-Cache—either directly into prefill or into decode then pass along.
How: Measure queue lengths and loads; choose the path that’s freer; move KV via RDMA on the compute network.
Why: One path alone becomes a choke point. 🍞 Anchor: If the prefill line is long, the decode line fetches first and hands you the bags through a staff hallway.

🍞 Hook: Highways have fast lanes for ambulances and slower lanes for trucks. 🥬 Traffic Isolation with QoS:

What: Put model-critical messages in high-priority lanes; send KV-Cache on low-priority lanes.
How: Configure InfiniBand virtual lanes (or RoCE TCs/DSCP) so high-priority gets most bandwidth.
Why: Without lanes, KV-Cache can delay token generation. 🍞 Anchor: The ambulance (model comms) always gets through; the moving truck (KV-Cache) waits a beat if needed.

🍞 Hook: If everything must pass through the same doorway, the doorkeeper can decide who goes first. 🥬 CNIC-Centric Copy:

What: Route all GPU-bound KV moves through the compute NIC (CNIC) using RDMA writes/reads, not ad-hoc PCIe copies.
How: Read from storage to DRAM, then CNIC→GPU (and back) so QoS applies end-to-end.
Why: Without this, side-copy paths can’t be prioritized and may interfere with model comms. 🍞 Anchor: One guarded door where priority people pass first, and boxes wait briefly.

🍞 Hook: Packing your backpack by subject keeps it light and organized. 🥬 Layerwise Prefill:

What: Stream KV a layer at a time.
How: Hold only one layer’s KV in HBM; free it, then load the next; overlap with compute.
Why: Without it, big batches don’t fit, and GPUs idle. 🍞 Anchor: Carry only the math book when it’s math period; swap for history later.

🍞 Hook: A dispatcher sends school buses where they’re needed. 🥬 Global Scheduler:

What: A controller decides which engines get which requests and who reads from storage.
How: Track token counts, queue lengths, and memory; balance across PEs and DEs to avoid hotspots.
Why: Without balancing, one bus route gets jammed and others sit empty. 🍞 Anchor: If the north gate is crowded, send the next bus to the south gate instead.

03Methodology

At a high level: Input requests → Scheduler decides path and placement → KV-Cache loads to PE or DE buffer → CNIC-assisted H2D to GPU and layerwise prefill runs (overlapped) → KV-Cache handed to DE → Decode and persist new KV.

Key steps with Sandwich explanations and concrete examples:

Decide who does what (Inter-engine scheduling) 🍞 Hook: If two supermarket checkouts are open, you join the shorter line. 🥬 What: The scheduler assigns each request to a (prefill engine, decode engine) pair and picks where to read KV-Cache.

How: 1) Group engines to reduce overhead; only leaders talk to the scheduler. 2) Collect each engine’s unfinished tokens and node read-queue length. 3) Prefer nodes with shorter storage queues and engines with fewer tokens. 4) For decode groups, balance by tokens and check HBM room. 5) Choose PE-read or DE-read based on which side has the shorter disk queue.
Why: Without this, one side clogs and the other side idles. 🍞 Anchor: If PE’s storage queue is long but DE’s is short, let DE fetch this time.

Stream layer by layer (Layerwise prefill) 🍞 Hook: You don’t carry every book to every class—just the one you need now. 🥬 What: Move and compute one layer’s KV-Cache at a time.

How: 1) For PE-read: storage→PE buffer→PE HBM for layer L; compute misses; then pass full layer KV to DE buffer; repeat for L+1. 2) For DE-read: storage→DE buffer; when PE needs layer L, DE→PE over compute network; compute misses; merge and continue.
Why: Without streaming, you’d hit HBM limits and shrink batch sizes. 🍞 Anchor: A 30k-token prompt fits because we bring one layer at a time, not all layers at once.

Move data without slowing tokens (CNIC-centric traffic manager) 🍞 Hook: Give ambulances the fast lane and trucks the slow lane. 🥬 What: Use compute NIC (CNIC) with QoS lanes for all GPU-bound copies.

How: 1) Read KV from storage to DRAM. 2) CNIC RDMA-write to GPU (H2D) and RDMA-read for D2H. 3) Assign model comms to high-priority lane (~99% reserved) and KV transfers to low-priority. 4) Use doorbell batching to lower per-operation overhead on many small chunks.
Why: Without lane separation and CNIC control, KV traffic can delay expert-parallel AllToAll or tensor-parallel collectives. 🍞 Anchor: Even during big KV moves, first tokens still arrive on time because comms keep the priority lane.

Use the fast hallway (RDMA over compute network) 🍞 Hook: A staff-only hallway moves things quickly between rooms. 🥬 What: When DE reads from storage, it sends KV to PE over the compute network using RDMA.

How: 1) DE buffer holds blocks. 2) RDMA transfers layer blocks to PE as prefill advances. 3) Overlap transfers with compute so GPUs don’t wait.
Why: Without the hallway, only the warehouse road feeds prefill, keeping the bottleneck. 🍞 Anchor: DE fetches a layer’s KV, PE computes while the next layer’s KV is already en route.

Pack blocks smartly (Block layout) 🍞 Hook: File folders (by subject) make papers easier to move and store. 🥬 What: Two block types: Layer Block (one layer for many tokens) and Full Block (all layers for many tokens).

How: 1) Storage uses Full Blocks for efficiency. 2) Transfers during compute use Layer Blocks for streaming. 3) Concatenate Layer Blocks to form Full Blocks when persisting.
Why: Without this, conversions or too many tiny transfers would waste time. 🍞 Anchor: We store whole folders in the cabinet, but hand sheets layer-by-layer to the teacher during class.

Keep batches balanced (Intra-engine scheduling) 🍞 Hook: Group projects go faster when each teammate has a similar workload. 🥬 What: Choose how many requests to put in the next prefill batch so attention time across GPUs stays similar.

How: 1) Estimate attention time from cached/miss tokens and a hardware profile. 2) Use a compute quota (e.g., 300ms). 3) If a request would exceed the quota, prefill a chunk of it and defer the rest.
Why: Without balance, some GPUs wait (bubbles) for others to finish. 🍞 Anchor: Everyone does about 5 minutes of work per round; if a task is bigger, we split it.

Decode and persist 🍞 Hook: After prep, serving is quick, and you tidy up as you go. 🥬 What: Decode GPUs generate tokens; as full blocks accumulate, persist new KV back to storage.

How: 1) H2D allocations, then low-latency decode. 2) Once a block (e.g., 64 tokens) is ready, write it out. 3) Free CPU buffers early to save memory.
Why: Without timely persistence, you’d lose reuse next turn. 🍞 Anchor: Every time we fill a page of notes, we file it so it’s ready for the next class.

Secret sauce:

Dual paths pool storage bandwidth across all nodes.
CNIC + QoS prevent KV transfers from hurting latency-critical comms.
Layerwise streaming and batch quota keep GPUs fed but not overstuffed.
The scheduler makes fast, local load-balancing choices using simple signals (token counts, queue lengths, HBM).

04Experiments & Results

🍞 Hook: To see if a new traffic plan works, you measure how many buses get kids to class on time without blocking the emergency lanes.

🥬 The Test:

What they measured: Throughput and latency for agent-style workloads. Offline: Job Completion Time (JCT) for many agents started together. Online: Time-to-First-Token (TTFT), Time-to-Second-Token (TTST), and per-token time (TPOT), with strict SLOs (e.g., $TTFT ≤ 4s$ , $TPOT ≤ 50ms$ ).
Why: These show whether the system moves lots of KV-Cache without delaying actual generation.

🍞 Anchor: If buses arrive faster and the ambulance still speeds through, the plan worked.

The competition:

Basic: their framework without DualPath.
SGL(MC): a popular open-source stack with DRAM KV caching + PD.
Oracle: an idealized upper bound with zero I/O overhead.

Scoreboard with context:

Offline batch inference (many agents, long contexts): DualPath cut JCT by up to 1. $87× vs$ . Basic. That’s like turning a long group project from 2 hours into about 64 minutes.
Online serving (steady arrival of agents with SLOs): Average serving capacity rose by ~1. $96× without$ violating latency SLOs. Think of hosting almost twice as many students without longer lines.
Trends: Gains grew with longer contexts and larger batches—exactly where storage I/O dominates. When appended or generated tokens per turn were larger (more compute), DualPath’s edge shrank, as expected.
P/D ratios: DualPath benefits held across 1P1D, 1P2D, 2P1D, etc., because it uses all nodes’ storage bandwidth, not just prefill’s.

Surprising/Notable findings:

CNIC-centric RDMA copies were faster than CUDA memcpy for many small chunks (≈1µs vs. 5–7µs per op), and QoS lanes kept TTST/TPOT stable.
Scheduling materially improved balance: storage NIC traffic max/avg ratio improved from ~1.53 to ~1.18; attention time across GPUs stayed tighter, reducing idle bubbles.
Large-scale runs (up to 1,152 GPUs) showed near-linear scaling in both offline and online settings with low scheduler CPU use, indicating the approach generalizes.

🍞 Anchor: DualPath turned idle doors and a quiet hallway into active pathways, so more kids got seated faster without blocking the emergency lane.

05Discussion & Limitations

🍞 Hook: Even a great traffic plan has places with road work, busy holidays, or maps that could be better.

🥬 Limitations:

What: DualPath needs integration (about 5k LOC changes) into inference stacks and careful NIC/QoS setup.
How it shows up: Complex deployments, especially mixed fabrics or limited QoS control, may blunt gains.
Why it matters: Without correct configuration, KV moves might still nudge latency.

Required resources:

Separate compute and storage networks; RDMA-capable NICs; QoS features (InfiniBand VLs or RoCE DSCP/TC); a distributed storage backend; profiling to set compute quotas and thresholds.

When not to use:

Workloads dominated by compute (long appends/long generations), tiny contexts (low KV reuse), or single-turn prompts; environments lacking RDMA/QoS; extremely memory-rich clusters that already fit most KV in DRAM.

Open questions:

Auto-tuning P/D ratios and parallelism online as workloads shift.
Splitting single requests to read simultaneously from both PE and DE to further reduce wait time.
Extending QoS guarantees across heterogeneous fabrics and next-gen interconnects.
Improving small-model decode transfer overheads (observed TPOT gap to Oracle) and further reducing tail TTFT.

🍞 Anchor: Like a city adding smart lights and new bike lanes later, there’s room to make DualPath even smoother and safer over time.

06Conclusion & Future Work

Three-sentence summary: DualPath fixes a hidden choke point in agent-style LLM serving by loading KV-Cache through two coordinated paths and enforcing traffic rules that protect model latency. It pools storage bandwidth across all nodes and uses the compute network’s QoS lanes plus smart scheduling to keep GPUs busy and tokens timely. In real workloads, that delivered up to 1. $87× faster$ offline throughput and about 1. $96× higher$ online capacity without breaking SLOs.

Main achievement: Turning an asymmetric, single-door KV-Cache pipeline into a balanced, dual-path system that exploits idle decode-side bandwidth and the fast compute fabric—without interfering with critical communications.

Future directions: Auto-adjust P/D ratios and parallelism; dual-sided partial reads per request; stronger QoS across diverse fabrics; reduce small-model PD transfer overheads; integrate with DRAM caches when economical.

Why remember this: As LLMs act more like agents, storage—not compute—often becomes the bottleneck; DualPath shows that re-routing data and policing lanes can matter more than buying more FLOPS. It’s a blueprint for getting twice the mileage out of today’s clusters while keeping answers snappy.

Practical Applications

•Serve more concurrent coding agents in CI pipelines by pooling storage bandwidth across decode and prefill nodes.
•Speed up RL rollouts for agentic training by streaming KV per layer and avoiding storage bottlenecks.
•Improve customer support bots that browse and summarize multiple pages over long sessions without latency spikes.
•Run large long-context analyses (e.g., legal or scientific review) with steady token latency despite massive KV reuse.
•Consolidate clusters by nearly doubling serving capacity, lowering cost per token for enterprise deployments.
•Stabilize latency SLOs during traffic bursts by isolating KV transfers with network QoS lanes.
•Reuse existing hardware more efficiently when storage NIC upgrades are impractical or too costly.
•Enable predictable multi-tenant serving where KV-heavy tenants no longer starve others’ token latency.
•Accelerate batch processing of long agent trajectories for analytics and evaluation pipelines.
•Combine with selective DRAM caching for hot prefixes to squeeze even more throughput where memory is available.

Version: 1