**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

Hugging Face Blog

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Intermediate

Hugging Face Blog3/19/2026

Key Summary

•SPEED-Bench is a new, unified test for Speculative Decoding that checks both how smart the guesses are and how fast the whole system runs.
•It has two special data splits: one to measure guess quality across many topics (Qualitative) and one to measure real serving speed at big scales (Throughput).
•The benchmark carefully picks diverse prompts using embeddings so models are tested on many different kinds of tasks, not just easy or similar ones.
•It also standardizes how inputs are tokenized and measured so results are fair across engines like TensorRT-LLM, vLLM, and SGLang.
•Results show that acceptance length and speedups depend a lot on the domain: coding and math are easier to speculate, while roleplay and writing are harder.
•Post-training drafters like EAGLE3 help, but models with native multi-token prediction heads can achieve even longer accepted runs.
•Using random tokens to test speed is misleading and can overestimate throughput by about 23% when Speculative Decoding is on.
•Vocabulary pruning can silently hurt acceptance on the long tail (like multilingual or RAG), which SPEED-Bench exposes thanks to its diversity.
•The Throughput split uses long input sequence lengths (1k–32k) and high batch sizes to reflect real applications like coding and retrieval.
•SPEED-Bench gives researchers and practitioners a practical, apples-to-apples way to evaluate accuracy and speed under realistic workloads.

Why This Research Matters

SPEED-Bench helps teams build faster, cheaper, and fairer AI systems by testing what actually happens in the real world. It shows where speculative decoding truly helps (like coding) and where it struggles (like creative roleplay), so developers can tune strategies per domain. By using long inputs and big batches, it matches modern apps like document chat and retrieval-augmented coding assistants. The unified framework makes cross-engine comparisons fair, preventing misleading wins caused by tiny formatting differences. It also exposes hidden trade-offs, like how vocabulary pruning hurts multilingual users or how random tokens inflate speed numbers. These insights guide smarter deployment choices, better user experience, and more trustworthy benchmarks.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you and a friend are writing a story together. Your friend tries to guess your next three words to speed things up. If they guess right, you both go faster. If they guess wrong, you slow down to fix it.

🥬 Filling (The Actual Concept): Speculative Decoding (SD) is a way large language models speed up writing by letting a lightweight “drafter” guess a few future tokens that the main “target” model then checks.

How it works (recipe):
1. The drafter suggests several next tokens in one go.
2. The target model verifies those tokens in parallel.
3. If the guesses match what the target would write, they are accepted; if not, the target corrects and continues.
Why it matters: Without SD, the target model writes one token at a time, which is slower and costs more.

🍞 Bottom Bread (Anchor): It’s like a teacher’s helper writing answer steps for a math problem in pencil. The teacher quickly checks them all at once. If most steps are right, they’re kept, and the class moves faster.

The world before: People were inventing many SD tricks, but measuring them fairly was messy. Benchmarks often used tiny prompt sets, short inputs, batch size one, or random tokens that don’t look like real user text. Serving engines handled prompts differently, so results didn’t match across systems. Teams couldn’t easily compare results because setups varied, and many tests didn’t stress the models in real ways (like long contexts or high concurrency).

🍞 Top Bread (Hook): You know how you can guess your friend’s favorite ice cream flavor better if you know the kind of ice cream shop you’re in? If it’s a gelato shop, your guesses change. Context matters.

🥬 Filling (The Actual Concept): Acceptance Rate (AR) tells us how often drafted tokens are kept, and Acceptance Length (AL) tells us how many in a row are accepted before a correction.

How it works (recipe):
1. The drafter proposes a short run of tokens.
2. The target accepts some or all of them.
3. We count what fraction were accepted (AR) and the typical run length (AL).
Why it matters: Without AR/AL, we can’t tell if SD is helping or if it just adds work that slows things down.

🍞 Bottom Bread (Anchor): If a drafter suggests three tokens and the target keeps two, that’s a good sign: the helper is saving time.

Problem: Real speedups depend on what kind of text people feed the model (low-entropy code vs. high-entropy roleplay), how long the input is, and how many users are served at once. But most tests didn’t reflect that. They ignored memory-bound vs. compute-bound regimes that appear at large batch sizes or long contexts.

🍞 Top Bread (Hook): Think of making pancakes. One short recipe card (short input) is easy. But a giant cookbook (long input) changes what slows you down: maybe flipping speed or how many pans you have becomes the bottleneck.

🥬 Filling (The Actual Concept): Input Sequence Length (ISL) is how long the prompt is, and Batch Size is how many prompts we cook at once.

How it works (recipe):
1. Short ISL means less prefill work; long ISL means heavy prefill and memory traffic.
2. Small batch can be compute-bound; large batch often becomes memory-bound.
Why it matters: Without testing across ISL and batch size, we can’t predict real serving performance.

🍞 Bottom Bread (Anchor): Serving 2 users with short prompts feels different from serving 256 users with 8k-token prompts; bottlenecks move around.

Failed attempts: Teams used random tokens to simulate load, which tricked models into trivial replies or odd expert routing. Some benchmarks had too few, too-similar prompts, so they hid failures on rare but important cases (like multilingual or RAG). Others compared engines without aligning tokenization and formatting, making results unfair.

The gap: We needed one benchmark that (1) covers many domains with high semantic diversity to test draft quality, (2) exercises real serving regimes with long inputs and big batches to test throughput, and (3) standardizes tokenization and timing so engines are compared fairly.

Real stakes: This affects how fast assistants answer, how many users a system can handle, and how much it costs. It changes whether coding tools feel snappy, whether long-document chat actually scales, and whether optimizations secretly hurt accuracy for certain users (like multilingual ones) without anyone noticing.

02Core Idea

The “Aha!”: If you don’t test both brains (semantic diversity) and brawn (system throughput under real loads) with a shared yardstick, you will draw the wrong conclusions about Speculative Decoding.

🍞 Top Bread (Hook): You know how a race car needs both a sharp driver and a strong engine? Testing only on a straight track or only on curvy roads tells you half the story.

🥬 Filling (The Actual Concept): SPEED-Bench unifies evaluation of SD with two complementary data splits and a shared, engine-agnostic measurement framework.

How it works (recipe):
1. Qualitative Split: Curate 11 diverse categories using embeddings and select prompts that are as different as possible inside each category.
2. Throughput Split: Build fixed ISL buckets (1k–32k) across three difficulty levels and high concurrency to map speed vs. latency trade-offs.
3. Unified Framework: Pre-tokenize and standardize prompt formatting so engines see identical inputs; record fine-grained timing and acceptance stats.
Why it matters: Without this trio, you can easily miss domain-dependent failures, overestimate speed with synthetic inputs, or compare engines unfairly.

🍞 Bottom Bread (Anchor): It’s like testing a bicycle by trying hills, flats, and turns, and using the same speedometer every time. You learn how it behaves everywhere—and you can compare it fairly to other bikes.

Multiple analogies:

Chef analogy: Qualitative Split tastes-test many cuisines (coding, math, roleplay), Throughput Split stress-tests the kitchen at dinner rush, and the Framework makes sure everyone uses the same recipe cards and timers.
School analogy: Qualitative Split is different subject exams; Throughput Split is timed exams with lots of students; Framework is the same grading rubric and timer so results are fair.
Sports analogy: Qualitative Split is training drills for skills; Throughput Split is full matches with crowd pressure; Framework is the shared scoreboard and referees.

Before vs. After:

Before: Benchmarks were small, narrow, and inconsistent; random tokens and short prompts gave rosy speed numbers.
After: SPEED-Bench shows acceptance and speedups vary widely by domain and serving regime; some optimizations (like vocabulary pruning) help average speed but hurt the long tail; random inputs can inflate throughput by about 23% under SD.

Why it works (intuition):

Diverse prompts surface domain entropy differences: predictable text (code/math) allows longer accepted runs; creative text (roleplay/writing) is harder to pre-guess.
Fixed ISL buckets and big batches expose compute-bound vs. memory-bound transitions that change SD’s payoff.
Pre-tokenized, standardized inputs isolate the algorithmic effect from formatting or tokenization quirks across engines.

Building blocks (with sandwich explanations for key terms):

🍞 Top Bread (Hook): You know how sometimes you keep guessing correctly a few times in a row when playing word games? 🥬 Acceptance Length (AL): AL is how many drafted tokens in a row the target accepts before a correction.
- How it works: The drafter proposes a short run; the target keeps a streak until a mismatch.
- Why it matters: Longer AL means fewer target steps and bigger speedups. 🍞 Bottom Bread (Anchor): If the drafter offers three tokens and the target keeps all three before correcting, AL is 3 for that attempt.
🍞 Top Bread (Hook): Imagine checking homework step by step; if the first step is correct, you’re more likely to trust the next. 🥬 Conditional Acceptance Rate (CAR): CAR measures the chance the k-th drafted token is accepted given the earlier ones were accepted.
- How it works: We condition each step’s acceptance on previous successes.
- Why it matters: It reveals where streaks tend to break. 🍞 Bottom Bread (Anchor): If token 1 is always accepted but token 2 is accepted 68% of the time after token 1, CAR shows where drafts weaken.
🍞 Top Bread (Hook): Picture mixing many different candies so your tasting covers everything, not just gummy bears. 🥬 Qualitative Split: A compact, highly diverse prompt set across 11 categories selected to minimize similarity.
- How it works: Embed prompts, pick a subset that spreads out in meaning-space.
- Why it matters: Without diversity, you miss domain-dependent failures. 🍞 Bottom Bread (Anchor): Coding prompts that vary from regex to C++ templates plus roleplay prompts from many personas give a truer picture than 10 near-duplicates.
🍞 Top Bread (Hook): For a stress test, you don’t just run; you run long distances with a heavy backpack. 🥬 Throughput Split: Long ISL buckets (1k–32k), three difficulty levels, and high concurrency to test real serving loads.
- How it works: Group prompts by input length and difficulty; sweep batch sizes to map throughput vs. per-user TPS.
- Why it matters: Without long contexts and big batches, you miss memory-bound regimes and misjudge SD payoffs. 🍞 Bottom Bread (Anchor): Serving 512 requests of 8k tokens hits very different limits than serving 8 requests of 1k tokens.
🍞 Top Bread (Hook): To compare runners, you must use the same starting line and stopwatch. 🥬 Unified Measurement Framework: A light wrapper that pre-tokenizes inputs and unifies timing across engines.
- How it works: Engines receive identical token sequences; the framework logs acceptance, step latency, user TPS, and total throughput.
- Why it matters: Without it, tiny formatting differences skew results. 🍞 Bottom Bread (Anchor): Two engines now score the same essay with the same rubric, so you can trust who’s actually faster or more accurate.

Quick formulas with examples:

Acceptance Rate: $AR = \frac{\text{accepted tokens}}{\text{drafted tokens}}$ . Example: If 2 of 3 drafted tokens are accepted, $AR = \frac{2}{3} = 0.67$ .
Throughput (Output TPS): $\text{Output TPS} = \frac{\text{total output tokens}}{\text{time in seconds}}$ . Example: 10,000 tokens in 5 seconds gives $\frac{10000}{5} = 2000$ TPS.
Average pairwise similarity (conceptual): $\bar{s} = \text{average of } \cos(e_i,e_j) \text{ over pairs}$ . Example: If two pairs have cosine 0.2 and 0.4, $\bar{s} = \frac{0.2+0.4}{2} = 0.3$ .

03Methodology

At a high level: Input data → Build Qualitative Split (diverse prompts) and Throughput Split (ISL buckets) → Unified Measurement Framework (pre-tokenize, standardize) → Run engines with SD → Record acceptance, latency, and throughput → Analyze domain- and regime-dependent behavior.

Step A. Curate the Qualitative Split (semantic diversity for draft quality)

What happens: Gather prompts from 18 public sources and group them into 11 categories (Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning, QA). For each category, embed all candidate prompts using a pretrained text embedder and select 80 that are as different as possible by minimizing average pairwise cosine similarity. Total prompts: 880.
Why this step exists: Draft quality depends on domain entropy. If prompts are too similar, you’ll overfit to easy patterns and miss the hard corners (like multilingual or long-tail RAG).
Example with actual data: In Coding, you might include Python refactoring, SQL joins, regex generation, C++ templates, and documentation generation—each far apart in embedding space so they don’t cluster.

Sandwich recap:

🍞 Hook: Sorting your candy so you taste many flavors.
🥬 Concept: Qualitative Split = small but widely spread prompt set for accurate acceptance measurement.
🍞 Anchor: 80 diverse Roleplay prompts prevent overestimating SD on creative tasks.

Step B. Construct the Throughput Split (realistic serving workloads)

What happens: Bucket prompts into fixed Input Sequence Lengths (ISL) from 1k to 32k tokens. For each ISL, collect 1,536 prompts: 512 each for low-, mixed-, and high-entropy difficulty. This supports building stable throughput vs. user TPS curves across batch sizes up to large values (for example, 512).
Why this step exists: As batch size grows, systems can flip from compute-bound to memory-bound, changing SD’s benefit. Long-context apps (coding assistants, RAG) are common; testing short ISL only is unrealistic.
Example with actual data: At 8k ISL with batch sizes from 2 to 512, you can trace how output TPS rises, where user TPS plateaus, and how SD’s draft length affects the Pareto frontier.

Sandwich recap:

🍞 Hook: Baking lots of cookies per hour means more trays (batch) and sometimes a limit from oven space (memory) rather than mixing speed (compute).
🥬 Concept: Throughput Split = long ISLs and high concurrency to test real-world speed.
🍞 Anchor: 32k-token prompts reveal whether SD helps or hurts when memory traffic dominates.

Step C. Unified Measurement Framework (fair comparisons across engines)

What happens: The framework handles tokenization and formatting outside the inference engine. Engines receive pre-tokenized inputs so BOS tokens, chat templates, and spacing are consistent across TensorRT-LLM, vLLM, and SGLang. It records acceptance behavior, per-step latency, user tokens-per-second (per-request speed), and output TPS (system throughput).
Why this step exists: Tiny tokenization differences can change the drafted sequence, making cross-engine comparisons unfair.
Example with actual data: An identical Llama 3.3 70B Instruct prompt and EAGLE3 drafter fed to TensorRT-LLM and vLLM now yield directly comparable acceptance lengths and TPS.

Sandwich recap:

🍞 Hook: Using the same stopwatch and starting line for every runner.
🥬 Concept: Unified Framework = standardized inputs and timing.
🍞 Anchor: Two engines now produce truly comparable AL and TPS numbers on the same prompts.

Step D. Metrics and reporting

Acceptance metrics: AL (accepted streak length) and AR (fraction accepted). Conditional Acceptance Rates (CAR) show how streaks decay with each extra token. • Formula: $AR = \frac{\text{accepted tokens}}{\text{drafted tokens}}$ . Example: 75 accepted out of 100 drafted means $AR = \frac{75}{100} = 0.75$ .
Throughput metrics: Output TPS (how many tokens per second across all requests) and User TPS (per-request token rate, a latency proxy). • Formula: $\text{Output TPS} = \frac{\text{total output tokens}}{\text{total seconds}}$ . Example: 2,518 tokens in 1 second gives $\frac{2518}{1} = 2518$ TPS.
Pareto curves: Sweep batch sizes to map trade-offs between high total throughput and good per-user speed.

Step E. Controls and realism

ISL control: Truncate or pad prompts deterministically to hit target ISLs without changing meaning.
No random-token shortcuts: Random noise can cause trivial replies or odd topic latching, inflating acceptance and throughput; it also misroutes experts in MoE models.
System scope: Measure at realistic concurrency and memory pressure to reveal compute-bound vs. memory-bound behavior.

Step F. Running the benchmark

Choose a target model (e.g., Llama 3.3 70B Instruct) and a drafter (e.g., EAGLE3), set a draft length (e.g., 3), pick an engine (TensorRT-LLM, vLLM, or SGLang), and run at a specified concurrency (e.g., 32) over the Qualitative or Throughput split.
The framework outputs acceptance histograms, per-category averages, Output TPS, per-GPU TPS, and request latency stats.

The secret sauce

Diverse-but-compact selection maximizes signal per prompt, avoiding bloated, redundant testbeds.
Long-context, high-batch stress reveals where SD helps or hurts under real constraints.
Engine-agnostic, pre-tokenized inputs isolate algorithmic effects from preprocessing quirks.
Together, these design choices make SPEED-Bench both practical for rapid iteration and faithful to production realities.

04Experiments & Results

The tests: Measure draft quality and system speed in realistic conditions.

What: Acceptance Length (AL), Acceptance Rate (AR), Conditional ARs, Output TPS (total tokens/sec), and User TPS (per-request speed).
Why: AL/AR tell if SD guesses are trustworthy; TPS numbers show if speedups hold up under real load.

The competition: Prior unified SD benchmarks like SpecBench offered important steps but had small, short prompts and limited diversity. SPEED-Bench compares methods and engines under the same standardized setup.

Scoreboard highlights (batch size around 32, draft length 3; example models):

Domain-dependent ALs: • Coding and Math (low entropy) show higher ALs. • Roleplay and Writing (high entropy) show lower ALs.
Method differences: • N-Gram speculation (lightweight) can cause net slowdowns at moderate batch sizes (mean speedup below 1x), indicating overheads can outweigh gains. • Post-trained drafters like EAGLE3 improve ALs and speedups over simple baselines. • Native multi-token prediction (MTP) heads (e.g., Qwen3-Next) can push AL even higher, showing the benefit of co-training drafter and base model.

Concrete numbers from representative runs:

Example Qualitative run (Llama 3.3 70B Instruct target, EAGLE3 drafter, TensorRT-LLM, concurrency 32): Overall average acceptance length around 2.45 across categories; Output TPS near 2,518 (about 315 per GPU on $8×H100$ ).
Cross-method comparison (illustrative from the paper): • Llama 3.3 70B with N-Gram: Mean $AL ≈ 1$ .41; mean $speedup ≈ 0$ .88x (possible slowdown). • GPT-OSS 120B with EAGLE3: Mean $AL ≈ 2$ .25; mean $speedup ≈ 1$ .34x. • Qwen3-Next with MTP: Mean $AL ≈ 2$ .81; mean $speedup ≈ 1$ .20x at the tested setting.

Surprising findings:

Random tokens overestimate throughput: When SD is enabled, using random-token prompts can inflate Output TPS by about 23% compared to realistic prompts at the same ISL, due to trivial responses and topic latching that distort acceptance behavior.
Long-tail sensitivity to vocabulary pruning: Pruning the output vocabulary (e.g., in EAGLE3’s projection layer) can leave Coding/Math largely unaffected but notably reduce acceptance in Multilingual, RAG, and Summarization—failures a low-diversity benchmark would likely miss.
MoE expert activation differs: Random inputs fail to trigger realistic expert routing in mixture-of-experts models, leading to misleading speed measurements even without SD.

Per-metric intuition with quick formulas and examples:

Acceptance Rate: $AR = \frac{\text{accepted}}{\text{drafted}}$ . Example: 180 accepted out of 240 drafted gives $\frac{180}{240} = 0.75$ .
Output TPS: $\text{Output TPS} = \frac{\text{tokens}}{\text{seconds}}$ . Example: 25,000 tokens in 10 seconds gives $\frac{25000}{10} = 2500$ TPS.
Conditional AR (conceptually): $CAR_k = P(\text{k-th accepted} \,|\, \text{first k−1 accepted})$ . Example: If token 1 is accepted 100% and token 2 is accepted 68% after token 1, then $CAR_2 = 0.68$ .

Context for results: These outcomes underline that the same SD algorithm can shine or stumble depending on domain diversity, ISL, batch size, and system constraints. SPEED-Bench’s combined splits and unified framework uncover the true range of behavior, not just best-case numbers.

05Discussion & Limitations

Limitations (be specific):

Coverage vs. compactness: Even with 880 diverse prompts, the Qualitative split is still a sample, not the entire world; some niche domains may remain underrepresented.
Hardware dependence: Throughput results inevitably depend on GPU type, memory bandwidth, and interconnect; users should replicate on their own stacks.
Model-specific quirks: Acceptance behavior varies with tokenizer, vocabulary, and training data; insights may shift across families.
Dynamic serving policies: Real systems also do prioritization, scheduling, and caching, which may interact with SD in ways outside the benchmark’s scope.

Required resources:

Multi-GPU setups (e.g., $8×H100$ for some runs), production-grade engines (TensorRT-LLM, vLLM, SGLang), and storage for long-context datasets.
Ability to run high-concurrency tests to trace Pareto curves across batch sizes.

When NOT to use:

If you only care about tiny single-user demos with short prompts, SPEED-Bench’s long-ISL, high-batch insights may be overkill.
If your pipeline forces non-standard prompt munging that cannot be aligned with pre-tokenized inputs, cross-engine fairness may not hold.
If you rely on synthetic random-token load tests, SPEED-Bench will conflict with that practice (by design) because it’s misleading for SD and MoE.

Open questions:

How do newer native MTP heads vs. post-trained drafters compare as context windows keep growing beyond 32k?
Can we adaptively tune draft length by domain entropy signals on-the-fly to optimize AL and latency?
What are the best safety checks so SD doesn’t amplify hallucinations in high-entropy domains?
How do routing strategies in MoE interact with SD under multilingual and code-mixed prompts when experts specialize further?
Can vocabulary pruning be made adaptive so it keeps speed gains without harming long-tail domains?

06Conclusion & Future Work

Three-sentence summary: SPEED-Bench is a unified, diverse benchmark that evaluates Speculative Decoding both for draft quality across many domains and for real serving throughput under long contexts and high concurrency. It standardizes tokenization and timing across engines, revealing domain-dependent acceptance behavior, realistic speedups, and hidden costs from practices like random-token testing or vocabulary pruning. With SPEED-Bench, researchers and practitioners can finally compare SD methods fairly and make deployment-aware decisions.

Main achievement: Designing and releasing a two-split benchmark plus a unified measurement framework that isolates algorithmic effects and captures real-world serving regimes, exposing insights that older, low-diversity, short-ISL benchmarks missed.

Future directions: Expand domain coverage, add adaptive draft-length policies informed by entropy, integrate more engines and routing strategies (especially for MoE), and explore safeguards for high-entropy domains. Also, refine long-tail stress tests (multilingual, code-mixed, RAG) and investigate adaptive vocabulary pruning that preserves acceptance in rare domains.

Why remember this: SPEED-Bench changes how we judge SD—from lab-only numbers to deployment-ready evidence. It makes evaluations both fair (same inputs across engines) and real (long ISLs, big batches), preventing overly optimistic conclusions and surfacing long-tail failures that matter to users.

Practical Applications

•Choose the right speculative decoding method per domain (e.g., stronger drafters or shorter draft lengths for creative writing).
•Set serving policies that adapt draft length by detected prompt entropy to balance speed and accuracy.
•Benchmark engines (TensorRT-LLM, vLLM, SGLang) fairly using pre-tokenized inputs before committing to a deployment stack.
•Validate that vocabulary pruning does not harm multilingual or RAG workloads in your user base.
•Size infrastructure using Throughput split curves to pick batch sizes that maximize TPS without hurting user latency.
•Detect when workloads become memory-bound and adjust SD parameters or hardware accordingly.
•Replace random-token load tests with SPEED-Bench workloads to avoid inflated throughput claims.
•Compare native MTP heads vs. post-trained drafters for your models to decide on training investments.
•Track acceptance metrics per category to catch regressions in future model or system updates.
•Use ISL buckets to rehearse long-context scenarios (1k–32k) for document chat and codebase assistants.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes