Back to LLM Inference

The Complete Inference Pipeline: Putting It All Together

You've learned about KV cache, the model runner, tensor flow, CUDA graphs, and scheduling. Now let's trace a complete inference request from start to finish, seeing how all these pieces work together.

The Request Arrives

A user sends a prompt:

prompt = "What is the capital of France?"
sampling_params = SamplingParams(max_tokens=50, temperature=0.8)

Step 1: Request Enters the Engine

The LLMEngine receives the request:

def add_request(self, prompt: str, sampling_params: SamplingParams):
    # Tokenize the prompt
    token_ids = self.tokenizer.encode(prompt)
    # Result: [1234, 5678, 9012, 3456, 7890]  # 5 tokens
    
    # Create a Sequence object
    seq = Sequence(token_ids, sampling_params)
    # seq.token_ids = [1234, 5678, 9012, 3456, 7890]
    # seq.block_table = []  (empty, will be filled by block manager)
    # seq.num_cached_tokens = 0
    # seq.status = SequenceStatus.WAITING
    
    # Add to scheduler
    self.scheduler.add(seq)

The sequence is now in the waiting queue, ready to be processed.

Step 2: Scheduler Decides What to Do

The LLMEngine calls step(), which starts with scheduling:

def step(self):
    seqs, is_prefill = self.scheduler.schedule()

The scheduler checks:

def schedule(self):
    # Phase 1: Prefill new sequences
    seq = self.waiting[0]  # Our sequence
    
    # Calculate blocks needed: 5 tokens, block_size=64
    num_blocks_needed = (5 + 64 - 1) // 64 = 1
    
    # Allocate block 0
    seq.block_table = [0]
    seq.status = SequenceStatus.RUNNING
    
    scheduled_seqs = [seq]
    is_prefill = True
    
    return scheduled_seqs, True

Step 3: ModelRunner Prepares the Batch

The LLMEngine calls the model runner:

token_ids = self.model_runner.call("run", seqs, is_prefill=True)

ModelRunner's run() method is invoked:

def run(self, seqs, is_prefill):
    # Prepare the batch
    if is_prefill:
        input_ids, positions = self.prepare_prefill(seqs)

prepare_prefill() extracts the tokens and positions:

def prepare_prefill(self, seqs):
    input_ids = []
    positions = []
    cu_seqlens_q = [0]
    cu_seqlens_k = [0]
    slot_mapping = []
    
    for seq in seqs:  # Just our sequence
        seqlen = len(seq)  # 5
        
        # Extract all tokens (none cached yet)
        input_ids.extend(seq.token_ids)  # [1234, 5678, 9012, 3456, 7890]
        
        # Generate positions
        positions.extend(list(range(0, seqlen)))  # [0, 1, 2, 3, 4]
        
        # Cumulative lengths for packed batching
        cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen)  # [0, 5]
        cu_seqlens_k.append(cu_seqlens_k[-1] + seqlen)  # [0, 5]
        
        # Map to KV cache slots
        for token_idx in range(seqlen):
            block_idx = token_idx // self.block_size  # All in block 0
            offset = token_idx % self.block_size      # 0, 1, 2, 3, 4
            physical_block = seq.block_table[block_idx]  # 0
            slot = physical_block * self.block_size + offset
            slot_mapping.append(slot)  # [0, 1, 2, 3, 4]
    
    return input_ids, positions

Result:

input_ids = tensor([1234, 5678, 9012, 3456, 7890])
positions = tensor([0, 1, 2, 3, 4])
slot_mapping = tensor([0, 1, 2, 3, 4])
cu_seqlens_q = tensor([0, 5])
cu_seqlens_k = tensor([0, 5])

Step 4: Run the Model

logits = self.run_model(input_ids, positions, is_prefill=True)

Since this is prefill, eager execution is used:

def run_model(self, input_ids, positions, is_prefill):
    if is_prefill:
        # Eager execution
        hidden_states = self.model(input_ids, positions)
        logits = self.model.compute_logits(hidden_states)
        return logits

The model processes the 5 tokens:

Input: [1234, 5678, 9012, 3456, 7890]
    ↓
Embedding: [5, 2048]
    ↓
Layer 1: Attention + MLP: [5, 2048]
    ↓
Layer 2: Attention + MLP: [5, 2048]
    ↓
... (24 layers total)
    ↓
Final norm: [5, 2048]
    ↓
compute_logits: [1, vocab_size]  (only last token)

The attention kernel writes KV values to the cache:

Result:

logits = tensor([[0.2, -0.1, 0.8, ...]])  # shape: [1, 152064]

Step 5: Sample the Next Token

temperatures = self.prepare_sample(seqs)  # [0.8]
token_ids = self.sampler(logits, temperatures).tolist()
# Result: [2345]  # "Paris"

Step 6: Update the Sequence

Back in the scheduler:

def postprocess(self, seqs, token_ids):
    for seq, token_id in zip(seqs, token_ids):
        seq.append_token(token_id)
        seq.num_cached_tokens = len(seq)  # 6
        
        # Check if finished
        if token_id == self.eos or seq.num_completion_tokens >= seq.max_tokens:
            seq.status = SequenceStatus.FINISHED
            self.block_manager.deallocate(seq)
        # Not finished yet, stays in running queue

The sequence now has:

seq.token_ids = [1234, 5678, 9012, 3456, 7890, 2345]
seq.num_cached_tokens = 6
seq.status = SequenceStatus.RUNNING

Step 7: Next Iteration - Decode Phase

The next call to step() schedules again:

seqs, is_prefill = self.scheduler.schedule()

Now the sequence is in the running queue, so it gets scheduled for decode:

# Phase 2: Decode from running sequences
for seq in self.running:
    scheduled_seqs.append(seq)

is_prefill = False
return scheduled_seqs, False

ModelRunner prepares for decode:

def prepare_decode(self, seqs):
    input_ids = []
    positions = []
    context_lens = []
    slot_mapping = []
    
    for seq in seqs:  # Our sequence
        # Only the last token
        input_ids.append(seq.last_token)  # 2345
        
        # Position in sequence
        positions.append(len(seq) - 1)  # 5
        
        # Attend to all previous tokens
        context_lens.append(len(seq))  # 6
        
        # KV slot for new token
        slot = seq.block_table[-1] * self.block_size + seq.last_block_num_tokens - 1
        # slot = 0 * 64 + 5 = 5
        slot_mapping.append(slot)
    
    return input_ids, positions

Result:

input_ids = tensor([2345])
positions = tensor([5])
context_lens = tensor([6])
slot_mapping = tensor([5])

Step 8: Run Decode with CUDA Graph

logits = self.run_model(input_ids, positions, is_prefill=False)

Since this is decode and batch size is 1, CUDA graph is used:

def run_model(self, input_ids, positions, is_prefill):
    if not is_prefill and not self.enforce_eager and input_ids.size(0) <= 512:
        # CUDA graph
        return self.run_cudagraph(input_ids, positions)
def run_cudagraph(self, input_ids, positions):
    bs = input_ids.size(0)  # 1
    
    # Find the right graph
    graph_bs = next(x for x in self.graph_bs if x >= bs)  # 1
    graph = self.graphs[graph_bs]
    
    # Update static tensors
    self.graph_vars['input_ids'][:bs] = input_ids
    self.graph_vars['positions'][:bs] = positions
    self.graph_vars['slot_mapping'][:bs] = slot_mapping
    self.graph_vars['context_lens'][:bs] = context_lens
    self.graph_vars['block_tables'][:bs] = block_tables
    
    # Replay the graph (ultra-fast)
    graph.replay()
    
    # Results are in graph_vars['outputs']
    return self.model.compute_logits(self.graph_vars['outputs'][:bs])

The GPU executes the pre-recorded graph:

Result:

logits = tensor([[0.5, 0.2, -0.3, ...]])  # shape: [1, 152064]

Step 9: Sample and Update

token_ids = self.sampler(logits, temperatures).tolist()
# Result: [5678]  # "is"

seq.append_token(5678)
seq.num_cached_tokens = 7

Steps 10-50: Continue Decode

This repeats 48 more times (max_tokens=50, already generated 2):

Step 3: Decode → "the"
Step 4: Decode → "capital"
Step 5: Decode → "of"
...
Step 50: Decode → "."  (EOS token)

Each decode step:

  1. Takes the last token
  2. Reads KV cache from previous tokens
  3. Computes attention
  4. Writes new K, V to cache
  5. Samples next token

All using the pre-recorded CUDA graph for speed.

Step 51: Sequence Finishes

if token_id == self.eos or seq.num_completion_tokens >= seq.max_tokens:
    seq.status = SequenceStatus.FINISHED
    self.block_manager.deallocate(seq)  # Free block 0

The sequence is complete:

seq.token_ids = [1234, 5678, 9012, 3456, 7890, 2345, 5678, ..., 7890]

Step 52: Return to User

def generate(self, prompt, sampling_params):
    self.add_request(prompt, sampling_params)
    
    while not self.is_finished():
        output, num_tokens = self.step()
    
    # Decode tokens to text
    text = self.tokenizer.decode(seq.token_ids[len(prompt_tokens):])
    # Result: "Paris is the capital of France."
    
    return text

The Complete Timeline

T=0ms:   Request arrives
T=1ms:   Prefill 5 tokens (embedding + 24 layers)
T=150ms: Sample token 1
T=151ms: Decode token 1 (CUDA graph)
T=152ms: Sample token 2
T=153ms: Decode token 2 (CUDA graph)
...
T=200ms: Decode token 49 (CUDA graph)
T=201ms: Sample token 50 (EOS)
T=202ms: Return "Paris is the capital of France."

Total time: ~200ms for a 50-token response.

Key Insights

  1. Prefill is expensive (~150ms for 5 tokens)
    • All tokens processed together
    • All layers computed
    • KV cache written
  2. Decode is fast (~1ms per token)
    • One token at a time
    • CUDA graph eliminates launch overhead
    • KV cache read, not written
  3. Memory is managed carefully
    • Block 0 allocated for the sequence
    • KV values written to specific slots
    • Block deallocated when sequence finishes
  4. Everything is optimized
    • Paged attention for memory efficiency
    • CUDA graphs for speed
    • Continuous batching for GPU utilization
    • Packed-ragged batching for prefill

This is why modern inference engines can serve hundreds of concurrent requests efficiently. Every component—from memory layout to scheduling to kernel optimization—works together to maximize throughput while minimizing latency.