AIO APEX

Speculative Decoding: How AI Models Are Getting Faster Without Getting Bigger

Share:
Speculative Decoding: How AI Models Are Getting Faster Without Getting Bigger

The Speed Bottleneck in Large Language Models

Large language models generate text one token at a time. Each token requires a full forward pass through a model that may have billions of parameters, and those passes must be sequential — you cannot generate token N+1 until you have token N. For a model like GPT-4 or Claude 3, this means inference is fundamentally serial at the token level, making latency proportional to output length. This is not a hardware problem. Even on the fastest GPUs with perfect memory bandwidth, autoregressive decoding hits a wall because the architecture demands it. Speculative decoding sidesteps this constraint entirely by changing what the large model actually does during a forward pass.

What Speculative Decoding Actually Does

The core idea is deceptively simple: use a small, fast draft model to speculatively generate a sequence of candidate tokens, then use the large verifier model to check all of them in a single parallel forward pass. If the large model agrees with the draft's tokens, you accept them all at once. If it disagrees at position K, you reject tokens from K onward and re-sample from the large model's distribution at that position.

The critical insight is that the large model's forward pass is not bound by output length in verification mode — it can process a batch of K candidate tokens in roughly the same time as processing a single token for generation. When the draft model is accurate, you get K tokens for the price of one large-model forward pass. When the draft model is inaccurate, you lose some efficiency but never compromise output quality, because the verifier enforces exact alignment with the large model's distribution.

Formally, if the draft model proposes token x at position i with probability q(x), and the target model assigns probability p(x), then the token is accepted with probability min(1, p(x)/q(x)). Rejected tokens are resampled from a corrected distribution (p - q), normalized. This rejection sampling scheme guarantees that the final output distribution is identical to what you would get from the large model running alone — speculative decoding is lossless by construction.

Draft Models: The Engine Behind the Speedup

The quality of the draft model determines everything. A draft model that achieves a token acceptance rate (TAR) of 80% on typical inputs delivers roughly 3–4x speedup on long sequences. A TAR of 60% yields 1.5–2x. Below 50%, the overhead of running both models starts to eat into the gains.

Two architectural approaches dominate in practice:

  • Independent small models: A separate model trained on the same data as the large model but at a fraction of the size. For example, using a 7B model as a draft for a 70B verifier. This is the approach used in the original speculative decoding paper by Leviathan et al. (2023) and remains the most widely deployed.
  • Medusa heads: Google's Medusa architecture adds multiple lightweight "heads" directly to the base model's final layer, each predicting tokens at different offsets into the future (position +1, +2, +3, etc.) in a single forward pass. Because Medusa heads share the base model's representations, they achieve higher acceptance rates than an independent draft model for the same compute cost. Medusa-2 further improves this by jointly fine-tuning the heads with the base model.

A third approach, self-speculative decoding, skips certain layers of the large model during the draft phase and uses the full model for verification. This avoids the need to maintain a separate draft model but requires careful ablation to determine which layers can be safely skipped per domain.

Real-World Adoption: Where Speculative Decoding Is Deployed

Speculative decoding has moved from research to production across every major AI lab. The adoption pattern is telling: it is one of the few inference optimizations that requires no retraining of the target model and introduces no approximation error.

  • Google DeepMind integrated speculative decoding into Gemini's serving infrastructure in 2024, reporting 2x latency improvements on dialogue workloads. Their internal draft models are distilled from the target models, giving them higher TAR than generic small models.
  • Meta's SpecInfer extended the idea to tree-based speculation, where the draft model generates a tree of possible continuations rather than a single sequence. The verifier processes the entire tree in one pass, selecting the longest accepted path. This approach consistently outperforms single-sequence speculation when the draft model has higher uncertainty.
  • Hugging Face / vLLM / TensorRT-LLM all ship speculative decoding as a first-class serving feature. In vLLM, enabling draft model speculation requires a single configuration parameter and works transparently across batch sizes.
  • Apple uses a variant for on-device inference in Apple Intelligence, where the draft model runs on the Neural Engine and the verifier runs on the GPU — exploiting heterogeneous hardware to get both speed and quality.

Reported production speedups range from 1.5x to 3x depending on output length, domain, and draft model quality. Code generation and structured outputs tend to see the highest acceptance rates because the distribution is more predictable. Open-ended creative text sees lower acceptance rates because the large model's distribution is flatter, making the draft's guesses less reliable.

Token Acceptance Rates and Practical Limitations

The token acceptance rate is not fixed — it varies by domain, prompt, and draft model architecture. Empirical results across common benchmarks:

  • Code completion (HumanEval, MBPP): TAR typically 75–85%, speedup 2.5–3.5x
  • Summarization (CNN/DM, XSum): TAR 65–75%, speedup 2–2.5x
  • Open-ended chat: TAR 55–70%, speedup 1.5–2x
  • Translation: TAR 70–80%, speedup 2–3x

The main practical limitations are:

  • Memory overhead: Running two models simultaneously requires holding both in GPU memory. For a 70B verifier, adding a 7B draft consumes roughly 10% more memory — manageable, but a constraint in memory-limited deployments.
  • Batch size scaling: Speculative decoding's advantage diminishes as batch size increases. At batch size 1 (single-user real-time inference), the gains are maximum. At large batch sizes, the large model's GPU utilization is already high and the overhead of running the draft model competes for compute resources.
  • Draft model staleness: If the target model is updated (fine-tuned, RLHF'd), the draft model may diverge in distribution and acceptance rates drop. Maintaining draft-verifier alignment over model updates is a real operational cost.

Beyond Speculative Decoding: Lookahead and Jacobi Decoding

Two related techniques emerged prominently in 2025 that address some of speculative decoding's limitations, particularly the need for a separate draft model.

Lookahead decoding (developed at LMSYS and integrated into SGLang) decomposes inference into two parallel streams: a lookahead branch that generates n-grams speculatively using Jacobi iteration, and a verification branch that selects correct n-grams from a cache. No draft model is required. Instead, the method exploits the fact that Jacobi iteration over token sequences converges quickly for sequences that appear naturally in the model's training distribution. Lookahead decoding achieves 1.5–2.3x speedup on a single GPU without any additional model weights.

Jacobi decoding is the mathematical foundation underlying lookahead. Instead of the standard sequential decoding loop, it initializes all output positions simultaneously with random tokens and then applies parallel fixed-point iterations until the sequence stabilizes. Each iteration updates all positions in parallel using the large model, effectively turning a sequential problem into an iterative one. Convergence is fast in practice (2–4 iterations for most sequences), and the final distribution is identical to autoregressive decoding.

EAGLE-2 (2025) extended the Medusa approach by making speculation adaptive: the draft model generates a dynamic tree structure based on confidence scores, allocating more candidates to uncertain positions. EAGLE-2 achieved 3.5x speedup on LLaMA-3-70B-Instruct, the highest published number for a single-model serving setup at that scale.

In 2026, the focus has shifted to multi-step speculation with consistency guarantees — systems that run 2–3 rounds of speculation per verification step, further increasing the tokens-per-forward-pass ratio without breaking the lossless property. Google's internal Gemini serving stack reportedly uses a three-level cascade: a tiny (1B) model, a medium (8B) model, and the full verifier, where the medium model serves as both a verifier for the tiny model and a draft for the full verifier.

What Engineers Should Do Now

If you are building or operating LLM inference infrastructure, speculative decoding should be on your radar for any latency-sensitive workload. Concrete steps:

  • Evaluate your batch size profile first. If p95 concurrent requests per replica is below 8, speculative decoding will almost certainly help. Above 32, the gains may be marginal and memory overhead may not be worth it.
  • Use vLLM or SGLang as your starting point. Both ship production-ready speculative decoding. In vLLM, set --speculative-model and --num-speculative-tokens. Measure TAR on your actual production traffic before tuning.
  • For on-device or edge deployments, lookahead decoding is often more practical than maintaining two model files. SGLang's lookahead implementation works without any additional weights.
  • Profile domain-specific TAR. If you are serving a narrow domain (legal, medical, code), a domain-fine-tuned draft model will significantly outperform a generic one. The investment in fine-tuning a 1B–3B draft model often pays back in weeks at scale.
  • Watch the EAGLE-2 and MEDUSA-2 ecosystems. These are moving fast. If your target model is in the LLaMA or Mistral family, community-trained draft heads are already available on Hugging Face and require no training investment.

Speculative decoding is mature enough to use in production today and active enough in research that the best implementations in 2026 will likely look meaningfully different from what exists now. The core principle — verify in parallel, generate speculatively — is here to stay. The draft model architectures and speculation strategies on top of it are still evolving rapidly.

Share:
Speculative Decoding: Faster AI Inference Without Bigger Models | AI Plus | AIO APEX