Standard speculative decoding is exact but brittle: a single rejection cascades into wasted computation and a hard restart. A paper from April 2026 replaces rejection sampling with Sequential Monte Carlo, adding 2.36× additional speedup on top of already-fast spec decode. Here is the mechanism from first principles and what it costs you.
Why Memory Bandwidth Is the Bottleneck
Autoregressive generation is memory-bandwidth-bound, not compute-bound. When you run a 70B model to generate one next token, you stream 70B parameter values from GPU HBM, do a comparatively small matrix-vector multiply, and produce one scalar. The ratio of bytes moved to FLOPs executed is poor — most of the arithmetic units sit idle while the memory bus runs at full throttle.
This means latency per token is proportional to model size, not to batch size, up to the point where your batch is large enough to saturate the compute pipeline. For a single-user inference request, that batch is size 1 and you are firmly in the memory-bound regime.
Speculative decoding exploits this asymmetry. A small draft model proposes K tokens cheaply (few parameters, low memory traffic). The large target model then evaluates all K candidates in a single forward pass — and here is the key: evaluating K tokens in a batch costs almost the same memory bandwidth as evaluating one, because you still load the same target model weights exactly once. If the draft is accurate, you have produced K tokens for the I/O cost of a single target pass.
Standard Spec Decode and Why It Breaks Down
The original spec decode formulation uses rejection sampling. The algorithm guarantees the output distribution is exactly p_target — the target model's true next-token distribution. For each draft token x at position t:
accept with probability: min(1, p_target(x | ctx) / p_draft(x | ctx))
if rejected: discard tokens t..K, sample a correction from max(0, p_target - p_draft) / Z
The guarantee is clean. If the draft model perfectly matches the target, you accept every token. If the draft is garbage, you accept almost nothing and waste the computation that produced all tokens beyond the rejection point.
Acceptance rate α is the critical quantity. The expected accepted tokens per target forward pass is:
E[accepted] = (1 - α^(K+1)) / (1 - α)
With α = 0.9 and K = 8: E[accepted] ≈ 6.5 tokens per pass. Excellent. With α = 0.5 and K = 8: E[accepted] ≈ 2.0 tokens per pass. You have just paid for an expensive coin flip.
Acceptance rate varies heavily by token type. Common syntax, connectives, repeated phrases — the draft nails these. Novel technical terms, multi-step reasoning chains, domain-specific numbers — the draft diverges from target and α craters. A draft model that averages α = 0.8 across all tokens might run at α = 0.4 on the hard tokens that matter most. The 2604.14682 paper on acceptance dynamics confirms this: task type is a stronger predictor of acceptance than tree depth, and only the chat domain consistently yields expected accepted length above 1.0.
Rejection is also wasteful in a specific way. When you reject at position 3 of a 10-token draft, you discard tokens 4–10 entirely. You ran the draft model for 10 steps, paid for a target model verification pass, and got 3 tokens. All the effort that went into predicting tokens 4–10 evaporates.
Sequential Monte Carlo: The Core Idea
SMC is a class of algorithms from Bayesian statistics, developed for nonlinear filtering problems — tracking moving objects from noisy radar, robot localization, option pricing under stochastic dynamics. The goal: approximate a target distribution p*(x) when you cannot sample from it directly but can evaluate it (or its unnormalized density) at any candidate point.
The approach: maintain a population of N particles (candidate samples). At each step:
- Propagate each particle forward using a cheap proposal distribution q(x).
- Compute importance weights w_i = p*(x_i) / q(x_i) for each particle.
- If the effective sample size (ESS) drops below a threshold, resample: copy high-weight particles, prune low-weight ones, and reset weights to uniform.
Over time, the population drifts toward p*. The approximation error shrinks as O(1/√N) — double the particles, halve the error.
The mapping to speculative decoding is direct:
- Target distribution: the target model's next-token probabilities p_target(x | ctx)
- Proposal distribution: the draft model's next-token probabilities p_draft(x | ctx)
- Particles: N candidate draft sequences extended in parallel
- Importance weight at each step: p_target(x_i | ctx) / p_draft(x_i | ctx)
Instead of binary accept/reject, you do soft resampling. Every particle stays alive. High-weight particles get copied more often. Low-weight particles get copied less. The population as a whole concentrates on what the target model prefers.
The Algorithm
def smc_speculative_decode(draft_model, target_model, context, K=8, N=16):
"""
K: tokens to speculate per draft round
N: number of parallel particles
Returns one sampled continuation from the approximate target distribution
"""
particles = [context.copy() for _ in range(N)]
log_weights = [0.0] * N
for step in range(K):
# Each particle proposes one token from the draft model
draft_tokens, draft_log_probs = [], []
for i in range(N):
tok = draft_model.sample(particles[i])
lp = draft_model.log_prob(particles[i], tok)
draft_tokens.append(tok)
draft_log_probs.append(lp)
# Target model evaluates all N proposals in ONE batched forward pass.
# Memory cost: load target weights once, run N token evaluations.
target_log_probs = target_model.log_prob_batch(particles, draft_tokens)
# Accumulate log importance weights
for i in range(N):
log_weights[i] += target_log_probs[i] - draft_log_probs[i]
particles[i] = particles[i] + [draft_tokens[i]]
# Resample when ESS drops below N/2
ess = compute_ess(log_weights)
if ess < N / 2.0:
indices = systematic_resample(log_weights, N)
particles = [particles[j].copy() for j in indices]
log_weights = [0.0] * N
# Select one particle proportional to final accumulated weights
final_idx = categorical_sample(log_weights)
return particles[final_idx]
def compute_ess(log_weights):
# ESS = 1/(sum of squared normalized weights)
# ESS = N when all weights are equal; ESS = 1 when weight on one particle
w = softmax(log_weights)
return 1.0 / sum(wi**2 for wi in w)
def systematic_resample(log_weights, N):
# One uniform draw u determines all N resample indices.
# Lower variance than N independent multinomial draws.
w = softmax(log_weights)
cumulative = list(itertools.accumulate(w))
u = random.uniform(0.0, 1.0 / N)
indices, j = [], 0
for i in range(N):
threshold = u + i / N
while cumulative[j] < threshold:
j += 1
indices.append(j)
return indices
The critical line is target_model.log_prob_batch(particles, draft_tokens). The target model runs one forward pass over all N particles simultaneously. Because LLM inference is memory-bandwidth-bound, batch size N is nearly free compared to loading the model weights — you get N importance weights for the cost of one model load.
This is the fundamental difference from standard spec decode: instead of running the expensive target model once and getting one binary outcome, you get N soft outcomes. Every target forward pass is fully utilized.
A Numerical Example
Four particles (N=4), one draft step. The draft model proposes a different token for each particle:
| Particle | Token | p_draft | p_target | w = p_t/p_d | w_norm |
|---|---|---|---|---|---|
| 0 | "the" | 0.40 | 0.50 | 1.25 | 0.323 |
| 1 | "a" | 0.30 | 0.10 | 0.33 | 0.085 |
| 2 | "some" | 0.20 | 0.30 | 1.50 | 0.388 |
| 3 | "any" | 0.10 | 0.10 | 1.00 | 0.259 |
ESS = 1 / Σw_norm² = 1 / (0.104 + 0.007 + 0.151 + 0.067) ≈ 3.04.
Threshold is N/2 = 2. ESS = 3.04 > 2, so no resampling this step. Particle 1 (which proposed "a", a low-probability target token) survives with reduced weight.
In standard spec decode, particle 1 would be rejected with probability 1 − 0.33 = 0.67. A rejection here discards all downstream computation and forces a correction sample. SMC keeps particle 1 alive at reduced influence — it might still contribute useful diversity in later steps.
If ESS had dropped below 2 (say, if particle 2 had weight 0.90 and the rest had 0.033 each), systematic resampling would copy particle 2 roughly 3.6 times and prune the others. After resampling, all weights reset to zero. The population is now more concentrated near "some" — which the target model prefers — but with some copied variants that can diverge in subsequent steps as the draft model explores from that point.
The Speedup Arithmetic
With N=16 particles and K=8 draft steps, a single target model forward pass evaluates 16 × 8 = 128 candidate tokens. Loading a 70B target model from HBM once and computing 128 attention/FFN operations costs barely more than computing 1. The memory bandwidth dominates; the arithmetic is nearly free at this batch size.
Standard spec decode with K=8 and α=0.75 produces E[accepted] ≈ (1 − 0.75^9)/(0.25) ≈ 3.8 tokens per target pass. SMC-SD with N=16 particles produces an expected accepted length of roughly 5–6 tokens per equivalent target pass, because the soft-resampling mechanism wastes far fewer verification opportunities. The paper reports 2.36× speedup over standard spec decode and 5.2× over pure autoregressive, while staying within 3% of the target model's accuracy on reasoning, instruction-following, and coding benchmarks.
Tradeoffs and Failure Modes
The core cost is approximation. Standard spec decode is provably exact — its output distribution matches p_target. SMC-SD outputs from an approximation whose KL distance from p_target shrinks with N but never reaches zero for finite N. Whether 3% distributional shift is acceptable depends on the application. For a chat assistant, it is invisible. For formal verification or anything where token-level distributions matter, it is a real concern.
Memory is the second cost. Each particle maintains its own KV cache for the draft context. With N=16 and K=8, you hold 16 separate draft sequences in flight. At a 7B draft model with 32 layers in FP16, each token's KV cache is roughly 32 × 2 × 128 × 2 bytes ≈ 16KB per token. Sixteen particles × 8 draft steps = 128 active token slots × 16KB = about 2MB for the draft KV cache alone. Larger draft models (13B, 30B) push this up proportionally. At some point, this evicts useful target KV cache and hurts overall throughput.
Resampling frequency is the third variable. If draft quality is consistently high, ESS stays high, resampling is rare, and SMC adds almost no overhead. If draft quality is terrible on every step — a badly mismatched domain, for example — ESS collapses immediately on each step, resampling happens constantly, particle diversity stays low, and you recover something close to standard spec decode behavior at higher memory cost. SMC-SD is not magic for bad draft models; it is a precise improvement for moderate draft quality.
The ESS threshold (N/2 by default) is a hyperparameter worth tuning. A lower threshold means less frequent resampling, more diverse particles, higher approximation error. A higher threshold means more aggressive resampling, lower approximation error, more computation spent on resampling overhead.
Architecture: Standard vs. SMC
flowchart LR
subgraph Standard["Standard Spec Decode"]
A1[Draft: generate K tokens] --> B1[Target: verify K tokens]
B1 --> C1{Accept each token?}
C1 -- Accept --> D1[Emit token]
C1 -- Reject --> E1[Discard tail, resample correction]
E1 --> A1
end
subgraph SMC["SMC-SD"]
A2[N particles: each draft 1 token] --> B2[Target: batch-eval N proposals]
B2 --> C2[Compute importance weights]
C2 --> D2{ESS below N/2?}
D2 -- Yes --> E2[Systematic resample N particles]
D2 -- No --> F2[Next draft step]
E2 --> F2
F2 --> G2{K steps done?}
G2 -- No --> A2
G2 -- Yes --> H2[Sample 1 particle by final weight]
end
Standard spec decode's reject branch kills particles and restarts, discarding computation. SMC's resampling redistributes weight — the low-weight particles are pruned, but the importance weights they contributed to the population state are not wasted.
Practitioner's Lens
If you run speculative decoding in production today — with vLLM, TGI, or a custom serving stack — SMC-SD is a drop-in change at the sampling loop level. It requires no model surgery, no retraining, no architecture change. The draft-and-verify infrastructure you already have is exactly what SMC-SD uses.
The particle count N is your primary knob. Start at N=8 to validate approximation error across your task distribution before pushing to N=16 or N=32 as memory budget allows. Monitor task-specific accuracy, not just aggregate metrics — distributional shift from the approximation tends to concentrate on the hardest subtasks where the draft model diverges most from target.
SMC-SD pays off most for deployments where the target model is very large (frontier-scale, 70B+) and the draft model is a distilled sibling with moderate accuracy. In that regime, standard spec decode's acceptance rate is often in the 0.6–0.75 range — good but not great. SMC recovers efficiency from precisely those modestly-bad draft steps that rejection sampling handles worst, by keeping divergent particles alive at reduced weight rather than throwing them away entirely.
One practical concern: the KV cache memory for N parallel draft sequences can evict useful target KV cache on long-context requests. Profile your memory footprint at the N you intend to deploy before committing to it in production.
Further Reading
- Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" (2022) — the original; §2 has the rejection sampling derivation that SMC-SD replaces.
- Chen et al., "Speculative Decoding" (2023) — Google's concurrent derivation with slightly different framing of the acceptance criterion.
- "Faster LLM Inference via Sequential Monte Carlo" (arXiv 2604.15672) — the primary paper this post covers.
- "Acceptance Dynamics Across Cognitive Domains in Speculative Decoding" (arXiv 2604.14682) — empirical breakdown of why acceptance rate varies by domain; motivates SMC's flexibility.
- Doucet & Johansen, "A Tutorial on Particle Filtering and Smoothing" (2011) — the canonical SMC reference; §2–3 map directly to SMC-SD's resampling mechanics.