Skip to main content

Command Palette

Search for a command to run...

Serving LLM's can be faster than you think !!

Updated
7 min read
Serving LLM's can be faster than you think !!
P

Machine learning and Data Science | Open Source contributor | Python Developer

Most of us might have used ollama, LM studio or GPT4All to host models locally or for production requirements, but all these platforms have been quietly shipping a feature which most of us don't come across. This feature can make models faster in generating token i.e 2x with few minor tweaks, and this feature is called speculative decoding. In production systems that use vLLM this idea has been taken so far that we are already seeing 1.69x speedups on fast version of models. I had a few conversations with engineers and leaderships running there own inference server come back with the same frustration below

  1. I have invested in a better hardware, the model is also good on benchmarks, but throughput is less.

  2. I can't afford to go bigger and swap my H100 architecture and increase the flops .

  3. Should I look towards llama.cpp to increase the model speed.

  4. Will quantization work for me, but I don't want to compromise with the response quality

Most of them don't know about speculative decoding that's already in there stack, and this requires no hardware change or upgrades. The magic part is this process is completely lossless giving massive gains in speed and throughput of the same model on existing infra. Anyone running Ollama, vLLM, LM studio or llama.cpp already have this. Let's try to understand Speculative decoding in this blog, and look at code samples that help us bring it our existing setup.

Why speculative decoding

Let's understand the old native methods by which models generate the next token, before directly jumping into speculative decoding.

Sequential Decoding

Whenever a LLM starts generating text it works like loading all your prompt/instructions and the whole context into the billion parameter model, and running all this tokens in one forward pass just to generate one token, once this token is generated it is appended to your previous tokens to generate the next token. You might have already started to understand the underlying problem here, what hurts more is we don't even occupy the whole GPU compute when trying to perform sequential decoding only 10-30% of GPU compute is used while rest of the cores sit idle. GPU's are built for massive parallelism, and H100 GPU can do trillions of floating point calculations in parallel. The model cannot generate 3-4 tokens in a single pass cause each token can change the previous context, and the model architecture is already trained to output only one single token, so we have a software limitation here. It's like I have a whole factory and hundred workers but I can pack only one box at a time cause the pulley i,e forward pass only brings one box in a single round.

What if I say you don't have to generate each token, you just need to verify them faster, so suppose you have a small model that is faster in writing drafts i.e tokens and bigger model just needs to verify them. That's where speculative decoding comes into picture

Speculative Decoding

The core idea of speculative decoding it instead of one model doing everything two models handle the same task. These two models are for eg.

  1. Draft model - small, fast and cheap (eg. llama 3 8B)

  2. Target model - large, slow and accurate (eg. llama 70B)

Here the work of the draft model is to generate K tokens ahead using sequential decoding, and the target model can verify all K tokens in one single forward pass. The accepted tokens are kept and appended to both the LLM's context, and if any token in the K tokens is rejected while verification all tokens ahead of it are dropped. Let's see a stepwise simulation of this

Suppose your prompt is

"The Capital of France is"

Step 1.

The Draft model runs first and predicts next K token, suppose my K here is 5.

"Paris", "and", "it", "is", "beautiful"

Step 2.

The target model takes the whole prompt, the next 5 draft tokens generated by the smaller model and verifies all the draft tokens in one forward pass.

"The Capital of France is [Paris] [and] [it] [is] [beautiful]"
                             ✅     ✅    ✅   ❌ 

Let's try to understand what just happened here, the draft model with 7B parameter generated a grammatically correct output, but the target model only accepted - Paris, and, it while rejected is which makes us drop beautiful as well. The target model trained on larger data may prefer words like serves*, *has or remains cause it wants to add some more information there instead of saying the city is just beautiful.

Step 3.

Keep the three accepted tokens, replace the rejected one with target models prediction. Which makes our sequence as given below

"The capital of France is Paris and it serves..."

The target model being smarter wants to add some extra information here therefore it rejected is beautiful from the sequence and added serves instead which opens new doors to add information about the city.

Let's just go through what happened above once again. The transformer model is meant to output probability for each sequence and word present in its vocabulary based on previous context, but sequential decoding only choose the next word which had the highest probability and dropping the next tokens. What speculative decoding does here is a smaller model generates K tokens and we verify the next tokens as well instead of just dropping them, and appending all the accepted tokens, while discarding the rejected the tokens. This whole process is lossless as we accept the tokens that would actually have been predicted by the target model, and even if we are able to accept only 1 or 2 tokens and discard the rest from k tokens, we still got a average of 1.5x boost.

Enable speculative decoding

Let's see how we can enable speculative decoding at different platforms

  1. Ollama
# Pull your target and draft models
ollama pull llama3.1:70b
ollama pull llama3.1:8b

# Run with speculative decoding
ollama run llama3.1:70b --draft llama3.1:8b
  1. vLLM
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,
)

sampling_params = SamplingParams(temperature=0.0)
output = llm.generate("The capital of France is", sampling_params)

Below is sample code on how you can enable speculative decoding your existing infra and gain some serious performance gains in terms on throughput. vLLM takes this to next level of research with there EAGLE and P-EAGLE family of models which have higher acceptance rate as a draft model, while the P-EAGLE model generates token parallelly. We will keep this topic for discussion in our next blog, as I keep talking about vLLM's production capabilities and many more amazing features.

Conclusion

We saw about the autoregressive behaviour of models, where every next token depends on the last token, while the GPU awaits. For people trying to cut there API costs by hosting there inferencing servers that host a 70B model or models below that to accomplish and automate there understanding tasks, can enable speculative decoding. In the next blog we will see about vLLM and EAGLE, P-EAGLE family of models.