How Speculative Decoding Makes AI 4× Faster

How Speculative Decoding Makes AI 4× Faster

You know that moment when you ask ChatGPT something and watch the response appear word by word… word by word… word by word? There is a reason for that glacial pace, and it is not laziness on the model’s part. Every single token it generates requires a full forward pass through billions of parameters. One token. One forward pass. Repeat until done.

This is where most people accept the slowness as inevitable. But some researchers at Google and UC Berkeley said no. They invented a technique called speculative decoding, and it fundamentally rewires how LLMs generate text. Instead of generating one token at a time, the model now drafts multiple tokens ahead, verifies them in parallel, and accepts the correct ones. It sounds simple until you realize the math: you can go from 1 token per forward pass to 2, 3, or even 4 tokens per pass. The result is 2 to 4 times faster inference. No retraining. No parameter changes. Just speed.

Let me explain how this actually works, because the engineering is more beautiful than it initially appears.

Why Token Generation Is So Painfully Slow

Large language models generate text through an autoregressive decoding process. This means they predict one token at a time, and each prediction requires a complete pass through the entire network. Think of it like a chef who can only prepare one dish at a time, and each dish requires checking every ingredient in the warehouse before cooking.

When you prompt Claude or ChatGPT, here is what happens under the hood. The model reads your prompt, compresses it into numbers the neural network understands, and produces probabilities for the next word. It picks the most likely one (or samples from the distribution). Then it starts over. New input includes the prompt plus all previously generated tokens. New forward pass. New probabilities. Repeat.

This is necessary because language is sequential and contextual. You cannot know what the fifth word should be without knowing the first four. The model genuinely needs the entire history to make accurate predictions.

But here is the inefficiency. While the model is computing those probabilities for token number 50, a human could already guess what tokens 51, 52, and 53 might be based on context and patterns. The model is being conservative and careful when it could afford to speculate.

Speculative decoding weaponises this insight.

The Draft and Verify Dance

The core idea splits the problem into two stages. First, a smaller draft model rapidly generates candidate tokens. Then, the large target model verifies them all in parallel. If the draft model guesses correctly, you get those tokens for nearly free. If it guesses wrong, you discard the incorrect speculations and resume from the last verified position.

The genius part is the mathematics. Verification is much cheaper than generation. Checking if a token belongs in a sequence requires only the output layer computation. Generating that token requires the full forward pass. A modern LLM might spend 80 percent of compute on the forward pass and 20 percent on output layer calculations. This means verification can happen 4 to 5 times cheaper than generation.

In practice, speculative decoding works like this. You have Claude 3.5 Sonnet (the big model) and Claude 3 Haiku (the small draft model). The Haiku model runs quickly and generates, say, 4 speculative tokens. Sonnet takes those 4 tokens and checks them in a single forward pass. The probability distribution for each position is computed in parallel. Tokens that match the target distribution are accepted. Tokens that do not match are rejected, and the model resumes from the last verified token.

The speed improvement depends entirely on the acceptance rate. If Haiku guesses correctly 90 percent of the time, you get close to 4x speedup. If it guesses correctly 50 percent of the time, the speedup drops to maybe 1.5x. The draft model does not need to be perfect. It just needs to be good enough that accepting tokens is more common than rejecting them.

Why This Works Without Cheating

The first question everyone asks is whether speculative decoding sacrifices quality. The answer is no, and the reason is mathematically proven. The verification step uses the target model’s probability distribution. When a token is accepted, it is accepted according to the target model’s preferences, not the draft model’s. The final output is identical to what the target model would have produced on its own.

Think of it like proofreading. If a human quickly edits a document and another human reviews every edit carefully, the final document reflects the careful reviewer’s standards, not the quick editor’s. The quick editor just accelerated the process.

This guarantee is why speculative decoding has become so attractive for production systems. There is no accuracy-speed tradeoff. It is just speed.

Real World Impact

The practical effects are staggering. Google researchers showed that combining speculative cascades (a variant that chains multiple drafting stages) with standard cascading techniques achieves speedups of 1.5 to 3 times on real workloads. On summarisation, translation, reasoning, and coding tasks, speculative cascades consistently beat the baseline.

At Anthropic and other AI labs, this technique has become a standard inference optimization. Services like vLLM, SGLang, and Ollama have integrated speculative decoding into their serving engines. GitHub reported that vLLM has over 40,000 stars, and much of its value comes from intelligent batching and speculative decoding optimisations.

The real test is end to end latency. When you query an LLM through an API, speculative decoding directly reduces time to first token and tokens per second. For a user typing a message and waiting for a response, faster inference feels like a more responsive system. For a business running inference at scale, faster means cheaper. Less GPU time. Lower costs.

The Draft Model Problem

The only real constraint is choosing a good draft model. The draft model needs to be fast enough that the overhead of running two models does not erase the speedup. It also needs to be reasonable at predicting the next tokens. Running a large model as the draft model defeats the purpose. You want tiny models like Llama 2 7B drafting for Llama 2 70B, or Haiku drafting for Sonnet.

Some systems use n-gram based drafting, which works like autocomplete. You look at the input and find previous sequences that matched. You suggest the next few tokens based on what historically came next. This is model-free and works surprisingly well for repetitive or structured text like code or documentation.

Others use auxiliary heads trained on the target model, which are specialized small networks that predict draft tokens. Some use older versions of the same model. The options are flexible because speculative decoding does not care how the drafts are generated. It only cares about verification.

The Bigger Picture

Speculative decoding is part of a larger trend in AI infrastructure. The industry is optimising for production deployment. Research models proved that LLMs work. Production systems now prove that they can work efficiently at scale.

This is why so many papers on LLM inference dropped in 2025. Researchers at UC Berkeley, MIT, Google, Anthropic, and academia across the board are racing to optimize every stage of inference. Paged attention. KV cache optimization. Quantization. Batching strategies. Speculative decoding. Every millisecond saved matters.

For a developer running models locally or serving them through an API, this matters in two ways. First, your queries will feel faster. Response times will improve visibly. Second, the cost of inference will drop. Faster inference on fewer GPUs means cheaper APIs and cheaper local deployment.

The real implication is that LLMs are becoming infrastructure, like databases or web servers. They are being optimized, deployed, and scaled with the same rigor. Speculative decoding is just one weapon in that arsenal.

What This Means for You

If you are building applications with LLMs, you should know this technique exists. When you evaluate an inference service or choose between vLLM, TensorRT, or other serving engines, some will use speculative decoding and some will not. The ones that do will be faster. When you run open-source models locally through Ollama or llama.cpp, these tools increasingly support speculative decoding out of the box.

The technique is not something you need to implement yourself. It is becoming a standard part of the inference optimisation toolkit. But understanding how it works explains why latency is dropping across the board and why the performance gap between different inference systems keeps widening.

The next time you use an LLM and notice the response feels snappier than it did six months ago, speculative decoding is probably doing work in the background. A small draft model is speculating ahead. The big model is verifying in parallel. And you are getting the speedup for free.


Post a Comment

Previous Post Next Post