dLLM: Simple Diffusion Language Modeling

Zhanhui Zhou; Lingjie Chen; Hanghang Tong; Dawn Song

dLLM: Simple Diffusion Language Modeling

Intermediate

Zhanhui Zhou, Lingjie Chen, Hanghang Tong et al.2/26/2026

arXiv

Key Summary

•dLLM is a single, open-source toolbox that standardizes how diffusion language models are trained, run, and tested.
•It supports two major diffusion styles—Masked Diffusion (MDLM) and Block Diffusion (BD3LM)—and lets you swap them with a one-line change.
•A plug-and-play Sampler interface makes fast decoding methods (like Fast-dLLM) drop-in, so models run much quicker without rewriting code.
•A unified evaluation pipeline reproduces official scores and shows that DLM performance is very sensitive to inference settings.
•With simple recipes, dLLM converts existing BERT encoders and autoregressive LMs into working diffusion chatbots using only supervised finetuning.
•Released checkpoints and scripts make it practical to build small DLMs from scratch on accessible hardware.
•Experiments show reasoning SFT helps large DLMs (like LLaDA and Dream) on several benchmarks and that BD3LM can shine on code.
•A terminal visualizer reveals the non-left-to-right token decoding order unique to diffusion models, aiding debugging and learning.
•Because components are modular, researchers can extend dLLM with new objectives, samplers, or models while keeping a clean, shared pipeline.

Why This Research Matters

dLLM lowers the barrier to building and evaluating diffusion language models by turning a messy set of scattered tools into a clean, shared toolkit. This means students, startups, and labs can reproduce published results, try faster decoders, and test new ideas without wrestling with incompatible code. Because the evaluation harness matches official settings, comparisons become fair and transparent, avoiding misleading numbers. With simple recipes and released checkpoints, even modest hardware can produce working diffusion chatbots from BERT or small AR backbones. Faster samplers like Fast-dLLM become a one-line swap, making DLMs more practical for real applications. Overall, the framework helps the community move faster together and makes cutting-edge DLM research accessible to many more people.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine building a LEGO city with your friends, but everyone uses different bricks, different rules, and different instruction sheets. Even if you all try to make the same house, the results look different and you can’t easily copy each other’s designs.

🥬 Filling (The Actual Concept)

What it is: Diffusion Language Models (DLMs) are AI systems that write text by slowly turning a messy sentence into a clear one, step by step.
How it works: (1) Start with a sentence where some parts are hidden or noisy; (2) Use the model to guess the hidden pieces; (3) Update the sentence; (4) Repeat until it’s fluent.
Why it matters: Without DLMs, we mostly have one-shot, left-to-right writers; DLMs add the power of iterative fixing and parallel updates.

🍞 Bottom Bread (Anchor) For example, to answer a question, a DLM can first fill in easy parts, then circle back to tricky words, improving the answer over several passes.

The World Before: AI text models were mostly autoregressive (AR), like neat typists who write strictly from left to right. Diffusion models offered a fresh superpower—iterative refinement and flexible decoding order—but the research code was scattered. Different labs built similar pieces (training, decoding, evaluation) in different ways. That made reproducing results tough and slowed progress.

🍞 Top Bread (Hook) You know how, in a word-guessing game, you cover some letters and your friend guesses what fits?

🥬 Filling (The Actual Concept)

What it is: Masked Diffusion (MDLM) hides random tokens and trains the model to fill them back in across noise levels.
How it works: (1) Pick a time t; (2) Mask each token with probability t; (3) Ask the model to predict the original token under that mask; (4) Repeat for many ts so it learns to recover easy and hard masks.
Why it matters: Without MDLM, you lose a simple, strong way to train DLMs that works across many tasks.

🍞 Bottom Bread (Anchor) It’s like practicing crossword puzzles with different numbers of blanks each time, until you’re good at filling any blank.

🍞 Top Bread (Hook) Imagine reading a book in chapters: you finish one chapter cleanly, then work on the next.

🥬 Filling (The Actual Concept)

What it is: Block Diffusion (BD3LM) splits text into blocks; it generates each block with diffusion but moves through blocks in order.
How it works: (1) Divide the sequence into blocks; (2) Keep past blocks fixed; (3) Denoise only the current block; (4) Reuse memory (KV cache) from history; (5) Move to the next block.
Why it matters: Without BD3LM, you can’t easily reuse context memory like AR models, making long generations slower.

🍞 Bottom Bread (Anchor) It’s like writing a report: lock in finished sections, then polish just the current paragraph.

The Problem: Even though many DLMs used similar training targets (like MDLM) and similar decoding tricks, there was no shared, clean framework. Each paper tweaked code differently. Inference APIs didn’t match, so new samplers were hard to try across models. Evaluation pipelines (datasets, prompts, post-processing, and decoding settings) varied quietly, so numbers weren’t apples-to-apples.

Failed Attempts: People tried copying code from one repo to another, but subtle differences (tokenization, padding rules, EOS handling, or temperature) changed results. Researchers wrote fast decoders, but they were tied to specific models. Evaluations used different harnesses with different defaults, so scores disagreed.

🍞 Top Bread (Hook) Think of a science fair where every experiment uses different rulers and thermometers—hard to compare!

🥬 Filling (The Actual Concept)

What it is: A unified Training Pipeline standardizes objectives and data handling.
How it works: (1) Pick MDLM or BD3LM; (2) Use the same trainer shell; (3) Plug data collators and small wrappers for special cases; (4) Train.
Why it matters: Without it, tiny config differences snowball into big result gaps.

🍞 Bottom Bread (Anchor) Switching from MDLM to BD3LM is like swapping a LEGO piece: one line changes the trainer, not the whole build.

🍞 Top Bread (Hook) Imagine a TV remote that works on every TV.

🥬 Filling (The Actual Concept)

What it is: A unified Inference Interface (Sampler) wraps different decoding algorithms behind the same .sample() call.
How it works: (1) Keep the model the same; (2) Swap Samplers (vanilla or Fast-dLLM); (3) Press go; (4) Compare speed and quality fairly.
Why it matters: Without this, testing a new sampler demands rewriting model code.

🍞 Bottom Bread (Anchor) You can try a fast sampler on LLaDA or Dream without touching their guts—just change which sampler you import.

🍞 Top Bread (Hook) You know how teachers grade with a shared rubric so scores are fair?

🥬 Filling (The Actual Concept)

What it is: A unified Evaluation Framework reproduces each model’s official settings but in one place.
How it works: (1) Match preprocessing; (2) Match decoding hyperparameters; (3) Match post-processing; (4) Run via one harness.
Why it matters: Without this, a tiny change (like temperature or EOS rules) can swing scores a lot.

🍞 Bottom Bread (Anchor) For example, not suppressing the EOS token early can tank results; the framework exposes and controls that switch.

Real Stakes: Students, startups, and labs can now build, measure, and share DLMs faster and more fairly—on reasonable hardware. The paper’s released checkpoints and minimal recipes show that you can turn a BERT or a small AR model into a working diffusion chatbot using only supervised finetuning, no architecture changes. That opens the door for many more people to explore diffusion-style text models.

02Core Idea

🍞 Top Bread (Hook) Picture a universal adapter that lets any plug fit any outlet—no more cable mess.

🥬 Filling (The Actual Concept)

What it is: The key insight is to standardize DLM training, inference, and evaluation into modular pieces you can swap like LEGO bricks.
How it works: (1) One trainer interface (MDLMTrainer, BD3LMTrainer) for common objectives; (2) One sampler interface to plug in different decoders (including fast ones); (3) One evaluation harness that faithfully matches official settings; (4) Lightweight wrappers for edge cases (like EOS handling) instead of code forks.
Why it matters: Without a common backbone, research fragments, results can’t be fairly compared, and good ideas travel slowly.

🍞 Bottom Bread (Anchor) With dLLM, moving from MDLM pretraining to BD3LM or to SFT is a few toggles—not a new codebase.

Three analogies to see it clearly:

Kitchen: Same stove (trainer), different pots (objectives), and recipes (configs). You can cook soups (MDLM) or layered bakes (BD3LM) without rebuilding the kitchen.
Orchestra: One conductor (framework) cues strings (training), winds (inference), and percussion (evaluation) so the music (results) stays in sync.
Factory: A conveyor belt with slots. Swap in a faster robot arm (Fast-dLLM Sampler) or new quality checks (evaluation settings) without stopping the line.

Before vs. After:

Before: Each DLM came with its own training loop, its own inference functions, and its own evaluation scripts. Trying a new sampler on an old model meant surgery.
After: Trainers and samplers follow one interface; evaluations are reproducible and comparable. Converting BERTs or AR models into DLMs is recipe-driven.

Why it works (intuition, not equations):

Separate concerns: keep model architectures, diffusion objectives, and decoding algorithms in their lanes; define narrow, clear interfaces.
Reduce footguns: put easy-to-miss details (e.g., attention masks, label padding, BOS/EOS rules, right-shift logits) into small, named wrappers so behavior is explicit.
Align evaluation: faithfully mirror original pipelines so comparisons are fair; then let users change knobs knowingly.

Building Blocks (each a small, swappable piece):

Trainers: MDLMTrainer and BD3LMTrainer share the same shell but switch the denoising target.
Data Collator Wrappers: NoAttentionMaskWrapper, labe $l_p$ a $d_t$ oke $n_i$ d rules, PrependBOSWrapper.
Samplers: MDLMSampler (vanilla) and MDLMFastdLLMSampler (speed-focused) share a .sample() signature.
Visualizer: A terminal tool to watch masks shrink and tokens appear in any order (uniquely DLM).
Evaluation Harness: A single entry point that loads per-task settings matching official configs.

🍞 Top Bread (Hook) Think of a racing game where you can swap the engine without changing the car.

🥬 Filling (The Actual Concept)

What it is: Fast-dLLM is a speed-boosting sampler that adds KV caching and parallel token updates to MDLM decoding.
How it works: (1) Reuse past-computed attention (cache) block-wise; (2) Update multiple confident tokens in one step; (3) Keep accuracy close by bounding uncertainty.
Why it matters: Without fast decoding, DLMs can be slow, limiting practical use.

🍞 Bottom Bread (Anchor) In benchmarks like GSM8K or HumanEval, dropping in the fast sampler boosts tokens-per-second by multiple times with small accuracy changes.

In short, the “aha!” is that DLM progress accelerates when the community shares one sturdy, modular toolbox where training, decoding, and testing click together cleanly.

03Methodology

At a high level: Inputs (model + data + config) → Trainer (MDLM or BD3LM) → Checkpoint → Sampler (vanilla or fast) → Evaluation Harness → Results and Visualizations.

Step 1: Set up the environment

What happens: Install dLLM and its HuggingFace-based dependencies (transformers, accelerate, peft, deepspeed/FSDP as needed).
Why this step exists: It ensures the same reliable backbone across different experiments and scales from small GPUs to multi-GPU clusters.
Example: A student with one GPU can use accelerate for single-device training; a lab uses DeepSpeed ZeRO-2 across $8×A100$ .

🍞 Top Bread (Hook) Imagine choosing between two practice games: fill-the-blanks or paragraph-by-paragraph.

🥬 Filling (The Actual Concept)

What it is: Trainers (MDLMTrainer, BD3LMTrainer) are the standardized training engines.
How it works: (1) Point trainer to model, tokenizer, datasets, and data collator; (2) Choose objective (masked or block diffusion); (3) Train, log, and save checkpoints; (4) Swap objectives by changing trainer class.
Why it matters: Without consistent trainers, tiny hidden differences (mask schedules, label padding, attention masks) break reproducibility.

🍞 Bottom Bread (Anchor) In code, MDLM pretraining and BD3LM pretraining differ by a one-line trainer swap, not a new script.

Step 2: Choose objective and configure data handling

MDLM: Randomly mask tokens per time t and train to predict originals; supports pretraining and SFT.
- Why needed: It’s the most common, simple path to strong DLMs.
- Example: Finetune LLaDA or Dream for reasoning with loss only on response tokens.
BD3LM: Split the sequence, denoise one block at a time conditioned on clean history.
- Why needed: Enables KV-cache reuse similar to AR models, often speeding long generations.
- Example: Use block size 32 at length 512 for instruction SFT.
Data Collator Wrappers:
- NoAttentionMaskWrapper: Keeps padding EOS visible so models learn when to stop.
- labe $l_p$ a $d_t$ oke $n_i$ d=eo $s_t$ oke $n_i$ d: Trains models to output EOS from extra masks.
- PrependBOSWrapper: Adds a BOS token for AR-to-MDLM adaptation.
- righ $t_s$ hif $t_l$ ogits (optional): Reuses next-token prediction; in some AR-to-DLM cases, authors found disabling it gave better results.
Why these exist: They make BOS/EOS rules and label padding explicit, avoiding silent mismatches.
Example: Turning an AR model (Qwen3-0.6B) into a DLM with SFT only: add PrependBOSWrapper, select MDLM or BD3LM trainer, set lengths and batch sizes, train for 10 epochs.

🍞 Top Bread (Hook) Think of a universal Play button that works for any song.

🥬 Filling (The Actual Concept)

What it is: The Sampler is a simple .sample() interface that decouples models from decoding algorithms.
How it works: (1) Build inputs (e.g., chat templates); (2) Choose MDLMSampler (vanilla) or MDLMFastdLLMSampler; (3) Call sample(); (4) Optionally visualize token histories.
Why it matters: Without this, testing a fast sampler means editing model code—a source of bugs.

🍞 Bottom Bread (Anchor) You can swap MDLMSampler for MDLMFastdLLMSampler to accelerate decoding on LLaDA or Dream using the same call.

Step 3: Fast decoding (optional but powerful)

Fast-dLLM adds two main tricks:
- Cache: Block-wise approximate KV caching reuses attention history during block decoding.
- Parallel: Confidence-based parallel token updates fill multiple masks per step.
Why this step exists: DLM inference can be slow; caching and parallelism raise tokens/sec with modest accuracy changes.
Example: On HumanEval or GSM8K, Cache & Parallel boosts throughput by multiple times; exact trade-offs depend on ma $x_n$ e $w_t$ okens and task.

🍞 Top Bread (Hook) Imagine seeing a puzzle fill itself in, piece by piece, not just the final picture.

🥬 Filling (The Actual Concept)

What it is: The Terminal Visualizer shows how masked tokens turn into confident tokens over steps.
How it works: (1) Log token states each step; (2) Render them with colors/symbols; (3) Let you inspect decoding order.
Why it matters: Without it, you miss unique DLM behavior (non-left-to-right decoding), making debugging harder.

🍞 Bottom Bread (Anchor) When generating code, you can watch identifiers lock in early while tricky logic gets revised over iterations.

Step 4: Unified evaluation

What happens: Extend lm-evaluation-harness with per-model, per-task settings that mirror official pipelines (prompts, decoding parameters, post-processing).
Why this step exists: DLMs are highly sensitive to inference hyperparameters. Reproducing official numbers needs exact settings.
Example: Changing whether EOS is suppressed early or the temperature is 0 vs. >0 can swing accuracy sharply; the framework makes this explicit.

Step 5: Minimal recipes, maximal access

BERT → DLM (BERT-Chat): Start from ModernBERT, run MDLM SFT on instruction datasets, loss on response tokens only; no architecture changes.
AR → DLM (Tiny-A2D): Start from Qwen3-0.6B, train MDLM or BD3LM with SFT only; release checkpoints; observe BD3LM strengths on coding.
Why this matters: Demonstrates that functional DLMs can be made with limited compute and clear, short scripts.

Secret Sauce (what makes this method clever)

Interfaces that are just right: narrow, clean trainer and sampler APIs that capture what varies and hide what shouldn’t.
Wrappers instead of forks: small utilities for BOS/EOS and masks keep behavior visible and testable.
Faithful evaluation: matching official settings ensures apples-to-apples; then the same harness supports fair cross-model comparisons.

04Experiments & Results

The Test: The authors measured three things—(1) whether the unified evaluation reproduces official results; (2) how reasoning SFT affects large open DLMs (LLaDA, Dream); and (3) whether small, accessible recipes can turn BERTs and AR models into capable DLMs. They also tested fast decoding (Fast-dLLM) under the same sampler interface to check speed vs. accuracy trade-offs.

The Competition: For large models, comparisons are against the original LLaDA and Dream numbers. For BERT-Chat, comparisons include GPT-2 sizes and Qwen1.5-0.5B baselines. For AR→DLM conversion, AR baselines are Qwen2.5-0.5B and Qwen3-0.6B (reported by original sources), while the new MDLM/BD3LM variants are trained only with SFT.

Scoreboard with context:

Reproduction accuracy: The unified evaluation closely matches official results for LLaDA and Dream across multiple tasks (MMLU, ARC, HellaSwag, GSM8K, HumanEval, MBPP), confirming the pipeline’s faithfulness.
Sensitivity spotlight: Figures show that even a single inference hyperparameter (e.g., not suppressing EOS, setting temperature to 0, or parallelizing token updates) can move accuracy a lot—like turning an A− into a C+ just by flipping one decoding switch.
Reasoning SFT on large DLMs:
- LLaDA-Instruct: GSM8K rises from about 79.9 to 80.5 (a small but consistent gain); other reasoning, planning, and coding tasks often improve too.
- Dream-Instruct: Larger jumps, e.g., GSM8K from ~64.4 to ~70.4; coding and puzzles also see improvements.
- Base models: Gains on in-distribution math (GSM8K, MATH500) but regressions on out-of-distribution planning and coding, reminding us that SFT can overfit to certain styles of reasoning.
BERT → DLM (BERT-Chat):
- ModernBERT-large-chat beats GPT-2 variants on many benchmarks and even outperforms Qwen1.5-0.5B-Chat on BBH and MATH—despite being encoder-only and needing no architecture changes. That’s like a sprinter winning some events without switching to a runner’s shoes.
- There’s still a gap to AR decoder-only models on general knowledge tasks (e.g., MMLU, HellaSwag), but the result proves viability.
AR → DLM (Tiny-A2D with Qwen3-0.6B):
- MDLM vs. BD3LM: The BD3LM variant shines on code, surpassing the original AR base model on HumanEval and MBPP, even with SFT only. That’s like a rookie team beating the home team in specific matches.
- On broader knowledge/reasoning tasks (MMLU, BBH), both DLM variants trail their AR counterparts at this small scale—consistent with expectations.
Fast-dLLM speedups: Under identical APIs, Cache and Parallel tricks multiply tokens-per-second by several times across tasks, with modest accuracy shifts that depend on ma $x_n$ e $w_t$ okens and the benchmark.

Surprising Findings:

How touchy decoding is: Small inference changes (e.g., early EOS handling, temperature, or number of parallel tokens) can swing accuracy widely. This highlights why the unified evaluation (with explicit configs) is essential.
Simplicity wins sometimes: For AR→DLM, skipping right-shift logits (a trick used in prior work) actually helped in these SFT-only conversions—evidence that recipes must be tested end-to-end.
Encoder-only backbones can chat: ModernBERT-based diffusion chatbots, with no architecture overhauls, are competitive on select tasks—an under-explored but promising path.

05Discussion & Limitations

Limitations:

Sensitivity to inference settings: DLMs can swing in performance when you change decoding hyperparameters; careful, documented configs are mandatory.
Scale gap on general knowledge: At small scales, converted DLMs often trail AR models on broad benchmarks like MMLU and BBH.
Objective coverage: While MDLM and BD3LM cover many popular DLMs, emerging objectives (e.g., edit flows, flow-matching variants) still need standardized trainers.
Inference trade-offs: Fast decoding methods can nudge accuracy down on some tasks; tuning for each use case is needed.

Required Resources:

GPUs: From a single GPU (for small SFT) to multi-GPU clusters (for large models). DeepSpeed/FSDP enable scaling.
Datasets: Instruction-tuning sets (Tulu 3 SFT, SmolTalk, OPC) for SFT; math/coding benchmarks for evaluation.
Time: Even simple SFT runs for multiple epochs; evaluation across many tasks adds overhead.

When NOT to Use:

If you need pure left-to-right generation with maximum single-step likelihood optimization and the environment is fully tuned for AR, AR models may still be simpler/effective.
If latency must be ultra-low and hardware is optimized for AR caches, a naive DLM decode without Fast-dLLM may be too slow.
If your evaluation setup must deviate heavily from official configs and you can’t control hyperparameters, comparisons may be misleading.

Open Questions:

RL for DLMs: How best to do reinforcement learning (as with d1 for reasoning) in a diffusion setup, at scale and stably?
Larger-scale parity: At what sizes and recipes do DLMs match or beat AR models on broad knowledge tasks?
Better fast decoding: Can we push caching/parallelism further without accuracy dips, or adaptively choose speed/quality per task?
Unified objectives: Can we bridge MDLM, BD3LM, and edit flows under one trainer with smart schedules?
Interpretability: How can we best leverage the non-left-to-right decoding order to understand and guide reasoning?

06Conclusion & Future Work

Three-Sentence Summary: dLLM is a modular, open-source framework that unifies training, inference, and evaluation for diffusion language models, so researchers can reproduce, compare, and extend ideas cleanly. With simple recipes and released checkpoints, it shows that BERT encoders and autoregressive models can be converted into practical DLMs using only supervised finetuning. A plug-and-play sampler interface supports fast decoding (e.g., Fast-dLLM), while a faithful evaluation harness reveals—and controls—the strong sensitivity of DLMs to inference choices.

Main Achievement: Turning a fragmented landscape into a single, standardized pipeline that covers the most-used diffusion objectives (MDLM, BD3LM), makes fast samplers drop-in, and reproduces official results across popular models and tasks.

Future Directions: Expand trainers to new objectives (e.g., edit flows, flow-matching), add RL-based post-training for reasoning, broaden model coverage, and further optimize decoding speed/quality trade-offs. Continue releasing small-model checkpoints and scripts so more people can participate with modest hardware.

Why Remember This: dLLM gives the community a common language and set of tools for diffusion LMs—like moving from a dozen dialects to one shared playbook—so good ideas spread faster, results are fairer, and building new DLMs becomes accessible to many, not just a few.

Practical Applications

•Reproduce LLaDA and Dream results in a single pipeline to validate prior work before extending it.
•Finetune open DLMs on reasoning datasets (e.g., s1K) using MDLMTrainer to boost math or coding performance.
•Convert an existing BERT encoder into a functional diffusion chatbot (BERT-Chat) with MDLM SFT and no architecture changes.
•Adapt a small autoregressive model (e.g., Qwen3-0.6B) into MDLM or BD3LM for parallel decoding and iterative refinement.
•Speed up deployed DLM inference by swapping in MDLMFastdLLMSampler to reach higher throughput under latency constraints.
•Debug DLM behavior with the terminal visualizer, inspecting token update order to diagnose failures or improve prompts.
•Run fair, apples-to-apples benchmark comparisons by using the unified evaluation harness that mirrors official settings.
•Prototype new diffusion objectives (e.g., edit flows) by adding a minimal trainer module without rebuilding the pipeline.
•Perform parameter-efficient finetuning (LoRA) on limited hardware by leveraging HuggingFace accelerate and DeepSpeed.
•Create teaching labs where students follow the same recipes to build, evaluate, and compare small DLMs on one GPU.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes