Diffusion LLMs: Denoising Text Instead of Predicting Tokens

Every language model you've shipped makes a silent bet: the best way to generate a sentence is to commit to each word before writing the next. That bet has driven a decade of progress. It also has a structural flaw that diffusion language models exploit.

Why This Matters

LLaDA 8B (February 2025) matched LLaMA3 8B on standard benchmarks despite a categorically different generation algorithm — iterative unmasking instead of left-to-right sampling. Dream 7B (August 2025) bootstrapped from a pretrained AR model and improved further. By 2026, second-generation diffusion models have addressed the KV-caching gap that made early implementations impractical for production. If you are building anything involving planning, constraint satisfaction, or long-form coherent generation, this design space is worth understanding.

First-Principles Setup

The Autoregressive Commitment Problem

In an autoregressive model, token $x_t$ is conditioned only on $x_{1:t-1}$. The causal mask enforces this: when generating the fourth word, the model cannot see or influence the fifth.

This works well for most text. It becomes a problem when later constraints should influence earlier choices — writing a sonnet that must rhyme in a specific scheme, or generating code that needs a variable name at line 1 to match a function signature at line 50. AR models handle this implicitly by learning to anticipate future constraints from training data. They cannot explicitly revise an earlier token given new information.

The canonical failure mode is the "garden path" problem: the model commits to a prefix that corners it, and the rest of the sequence degrades trying to recover. For complex planning tasks, this matters.

The Discrete Diffusion Alternative

Diffusion models in the image domain corrupt a clean image with Gaussian noise over T steps, then train a denoiser. At inference: start from pure noise, iteratively denoise.

For text, Gaussian noise is not meaningful — tokens are discrete. The natural analogue is masking: replace noise with a [MASK] token.

Forward process: At step t, independently replace each token with [MASK] with probability $t/T$. At $t=T$, the entire sequence is masked.

Reverse process: Train a bidirectional transformer to predict the original tokens from the partially masked sequence.

Inference: Start from a fully masked sequence. At each of T steps, predict all masked positions, then commit the most confident predictions. Repeat until fully unmasked.

This is masked discrete diffusion. LLaDA is its best-known implementation at scale.

The Mechanics

LLaDA Training Objective

The LLaDA training loss derives from the ELBO of the masked diffusion process. For a sequence $\mathbf{x} = (x_1, \ldots, x_L)$ and masking ratio $t \in [0, 1]$:

Sample $t \sim \text{Uniform}(0, 1)$.
For each token $x_i$, independently replace with [MASK] with probability $t$.
Pass the masked sequence through a bidirectional transformer (no causal mask).
Compute cross-entropy only on the masked positions.

$$\mathcal{L} = \mathbb{E}{t, \mathbf{x}, \tilde{\mathbf{x}}} \left[ \frac{1}{t} \sum{i: \tilde{x}i = \text{[MASK]}} \log p\theta(x_i \mid \tilde{\mathbf{x}}) \right]$$

The $1/t$ reweighting matters. Without it, the model trains mostly on lightly-masked examples (small $t$) and fails to learn denoising under heavy corruption. Reweighting ensures that a step with $t=0.9$ (nearly everything masked) contributes as much gradient as a step with $t=0.1$ (one token in ten masked).

The key architectural implication: no causal mask. The transformer attends bidirectionally over all positions. This is the fundamental break from autoregressive models.

Connection to masked language models. This objective resembles BERT's masked language model (MLM). The crucial difference: BERT uses a fixed masking rate (roughly 15%) and trains for discriminative tasks. LLaDA samples $t$ from the full range $[0, 1]$, learning the complete noise schedule — from near-trivial (a single mask in an otherwise complete sentence) to catastrophic (every token masked). That full-range training is what enables generation: the model has learned to reconstruct text from a blank slate, not just complete nearly-finished sentences.

def lllada_training_step(model, batch, MASK_TOKEN_ID):
    # batch: (B, L) integer token ids
    B, L = batch.shape

    # Sample masking ratio per sequence
    t = torch.rand(B, device=batch.device)              # (B,)

    # Independently mask each token with prob t
    mask = torch.rand(B, L, device=batch.device) < t.unsqueeze(1)  # (B, L)

    masked_batch = batch.clone()
    masked_batch[mask] = MASK_TOKEN_ID

    # Forward pass — bidirectional transformer, no causal mask
    logits = model(masked_batch)                        # (B, L, vocab_size)

    # Cross-entropy only on masked positions
    flat_logits = logits[mask]                          # (num_masked, vocab_size)
    flat_targets = batch[mask]                          # (num_masked,)
    per_token_loss = F.cross_entropy(flat_logits, flat_targets, reduction="none")

    # Reweight by 1/t, average over batch
    token_weights = (1.0 / t.clamp(min=1e-4)).unsqueeze(1).expand(B, L)[mask]
    loss = (per_token_loss * token_weights).sum() / B
    return loss

Inference: Iterative Confidence-Based Unmasking

Inference runs T denoising steps (T=10 to 20 for quality, fewer for speed):

@torch.no_grad()
def lllada_generate(model, prompt_ids, gen_len, T=20, MASK_TOKEN_ID=32000):
    B = prompt_ids.shape[0]
    prompt_len = prompt_ids.shape[1]

    # Prompt tokens followed by all-mask generation region
    x = torch.cat([
        prompt_ids,
        torch.full((B, gen_len), MASK_TOKEN_ID, dtype=torch.long)
    ], dim=1)

    still_masked = torch.zeros_like(x, dtype=torch.bool)
    still_masked[:, prompt_len:] = True
    tokens_remaining = gen_len

    for step in range(T):
        logits = model(x)                               # (B, L, V)
        probs = logits.softmax(-1)
        pred_tokens = probs.argmax(-1)                  # greedy; can sample instead
        confidence = probs.max(-1).values               # (B, L)

        confidence[~still_masked] = -1.0                # ignore already-committed tokens

        # Unmask approximately gen_len/T tokens per step
        n_unmask = max(1, round(tokens_remaining / max(1, T - step)))
        _, topk_idx = confidence.view(B, -1).topk(n_unmask, dim=-1)

        for b in range(B):
            for idx in topk_idx[b]:
                i = idx.item()
                if still_masked[b, i]:
                    x[b, i] = pred_tokens[b, i]
                    still_masked[b, i] = False

        tokens_remaining -= n_unmask

    return x[:, prompt_len:]

The schedule for how many tokens to unmask per step is a hyperparameter. Uniform schedules ($\lfloor \text{gen_len}/T \rfloor$ per step) work. Cosine or confidence-adaptive annealing tends to improve quality at the cost of added complexity.

Why Bidirectional Context Matters

Consider the partially masked sequence: "The [MASK] of the [MASK] is critical to quantum entanglement."

An AR model generating left-to-right must guess the first [MASK] before seeing "quantum entanglement." A diffusion model sees the full context at every step. The physics vocabulary in the suffix informs the noun choices in the prefix.

This is not magic. The model only sees the currently-unmasked tokens as context. But as the sequence fills from high-confidence positions outward, coherence builds globally rather than committing blindly to early tokens.

Dream 7B: Bootstrapping from an AR Checkpoint

Training LLaDA from scratch is expensive. Dream 7B takes a different path: initialize from a pretrained AR model, then adapt.

An AR model already has strong language representations. The causal mask is the main obstacle. Remove it, add a [MASK] token embedding, and finetune on the masked diffusion objective. The AR weights act as a warm start — attention heads reorganize from causal to bidirectional, but the semantic representations transfer intact.

Dream adds context-adaptive noise scheduling: instead of $t \sim U(0,1)$, the masking ratio adapts to input complexity. Longer, syntactically complex sequences use higher average masking ratios during training, forcing the model to reason over larger gaps. This turned out to be critical — without it, the model failed to generalize on complex reasoning tasks.

The result: Dream 7B exceeds LLaDA 8B trained from scratch at a fraction of the compute, by treating AR pretraining as the expensive first phase of a two-phase recipe.

Second Generation: Solving the KV Cache Problem

The practical bottleneck for early diffusion LLMs: no KV caching.

In an AR model, KV vectors for processed tokens are cached — each new token adds one KV pair. In a pure diffusion LLM, the full sequence is re-processed at every denoising step, including the prompt. With T=20 steps, you run 20 full forward passes over a prompt that an AR model processes once.

For a 1024-token prompt with 256-token generation at T=20: the model computes $(1024 + 256) \times 20 = 25{,}600$ effective token-positions. An AR model computes $1024 + 256 = 1{,}280$. A 20x compute overhead for the same output.

Second-generation models (SDAR, LLaDA 2.0, Fast-dLLM2) address this with blocked causal context:

Divide the generation region into blocks of size B (typically 8 to 64 tokens).
Within the current block: full bidirectional attention (diffusion-style denoising over T steps).
Across completed blocks: causal attention with KV vectors cached, never re-computed.

Once a block commits, it is frozen and appended to the KV cache exactly like AR tokens. The expensive T-step denoising loop operates only over the current block of B tokens, not the entire sequence.

flowchart LR
    subgraph Pure["Pure Diffusion LLaDA"]
        PM["Prompt + all masks"]
        PF["Full bidirectional\nattention — T steps\nover entire sequence"]
        PM --> PF
    end
    subgraph Blocked["Blocked Context LLaDA 2.0"]
        BPR["Prompt\ncached once"]
        BB1["Block 1\nT denoising steps\nover B tokens\ncached after commit"]
        BB2["Block 2\nT steps, cached"]
        BBN["Block N..."]
        BPR -->|causal KV| BB1
        BB1 -->|causal KV| BB2
        BB2 -->|causal KV| BBN
    end

The tradeoff is explicit: global bidirectionality is sacrificed for practical inference speed. Block size B is the tuning dial — larger B preserves more diffusion-style coherence within each chunk, smaller B reduces per-block latency and commits tokens to the KV cache sooner.

Tradeoffs and Failure Modes

Compute per token. Pure diffusion runs T forward passes per generation call. For T=20, the compute is $\mathcal{O}(T \cdot L^2)$ with full attention, versus AR's $\mathcal{O}(L^2)$ cumulative cost for the same sequence. Blocked context reduces this substantially by limiting re-computation to the active block, but the denoising portion still costs T passes.

Error compounding at low T. Confidence-based unmasking makes irreversible commitments. Consider a 64-token generation with T=5: each step commits roughly 13 tokens. If step 1 misidentifies a high-confidence but incorrect token, every subsequent step conditions on that error. With T=20, the model commits 3 to 4 tokens per step and has far more opportunity to read the growing unmasked context before each commit. The quality gap between T=5 and T=20 is significant — unlike speculative decoding, where draft rejection is mathematically exact, there is no correction mechanism once a diffusion token is committed.

Streaming is awkward. AR models produce tokens one-by-one, fitting naturally into streaming UI patterns. Pure diffusion models produce the full sequence at the end. Blocked context partially addresses this — each committed block can be flushed to the client — but within-block latency is still the sum of T forward passes, not a single one.

Local fluency vs. global coherence. Diffusion models produce better global structure at the cost of slightly lower local fluency. A well-trained AR model has internalized precise n-gram statistics; the diffusion model reconstructs local phrasing from bidirectional context at each step. The gap narrows with scale and data volume, but it does not disappear.

Training stability at low masking ratios. As $t \to 0$, almost nothing is masked and gradients become trivial. The $1/t$ reweighting partially corrects this, but in practice clipping $t \geq 0.01$ during training avoids degenerate near-zero batches. Very high masking ratios (t close to 1) can also destabilize early training before the model has learned basic language structure.

Practitioner's Lens

Diffusion LLMs are not drop-in AR replacements today. The inference stack (vLLM, SGLang, TGI) is built entirely around AR, and the best open diffusion models sit one to two generations behind the frontier AR models in absolute benchmark quality.

The clearest near-term use case: structured output generation with hard global constraints. JSON schemas, SQL queries, protocol buffers — cases where a field at position 20 should constrain tokens at position 5. AR models require constrained decoding hacks (grammar-guided sampling, rejection loops); diffusion models handle this natively through bidirectional attention.

Watch the blocked-context architectures specifically. Block size B is a continuous dial between AR latency and diffusion coherence, tunable per task. For agentic pipelines generating tool calls with complex argument interdependencies, even a block size of 32 may deliver measurable quality improvements over pure AR at comparable latency.

The longer bet: diffusion LLMs parallelize token generation within a block. As hardware trends toward wider SIMD and higher memory bandwidth relative to compute, the assumption that sequential AR generation is always fastest will weaken. The architecture that exploits parallel unmasking may age better than the one that does not.