The Complete Inference Pipeline: Putting It All Together
You've learned about KV cache, the model runner, tensor flow, CUDA graphs, and scheduling. Now let's trace a complete inference request from start to finish, seeing how all these pieces work together.
The Request Arrives
A user sends a prompt:
prompt = "What is the capital of France?"
sampling_params = SamplingParams(max_tokens=50, temperature=0.8)
Step 1: Request Enters the Engine
The LLMEngine receives the request:
def add_request(self, prompt: str, sampling_params: SamplingParams):
# Tokenize the prompt
token_ids = self.tokenizer.encode(prompt)
# Result: [1234, 5678, 9012, 3456, 7890] # 5 tokens
# Create a Sequence object
seq = Sequence(token_ids, sampling_params)
# seq.token_ids = [1234, 5678, 9012, 3456, 7890]
# seq.block_table = [] (empty, will be filled by block manager)
# seq.num_cached_tokens = 0
# seq.status = SequenceStatus.WAITING
# Add to scheduler
self.scheduler.add(seq)
The sequence is now in the waiting queue, ready to be processed.
Step 2: Scheduler Decides What to Do
The LLMEngine calls step(), which starts with scheduling:
def step(self):
seqs, is_prefill = self.scheduler.schedule()
The scheduler checks:
- Are there new sequences waiting? Yes, our sequence.
- Do we have space? Yes, plenty.
- Allocate KV cache blocks.
def schedule(self):
# Phase 1: Prefill new sequences
seq = self.waiting[0] # Our sequence
# Calculate blocks needed: 5 tokens, block_size=64
num_blocks_needed = (5 + 64 - 1) // 64 = 1
# Allocate block 0
seq.block_table = [0]
seq.status = SequenceStatus.RUNNING
scheduled_seqs = [seq]
is_prefill = True
return scheduled_seqs, True
Step 3: ModelRunner Prepares the Batch
The LLMEngine calls the model runner:
token_ids = self.model_runner.call("run", seqs, is_prefill=True)
ModelRunner's run() method is invoked:
def run(self, seqs, is_prefill):
# Prepare the batch
if is_prefill:
input_ids, positions = self.prepare_prefill(seqs)
prepare_prefill() extracts the tokens and positions:
def prepare_prefill(self, seqs):
input_ids = []
positions = []
cu_seqlens_q = [0]
cu_seqlens_k = [0]
slot_mapping = []
for seq in seqs: # Just our sequence
seqlen = len(seq) # 5
# Extract all tokens (none cached yet)
input_ids.extend(seq.token_ids) # [1234, 5678, 9012, 3456, 7890]
# Generate positions
positions.extend(list(range(0, seqlen))) # [0, 1, 2, 3, 4]
# Cumulative lengths for packed batching
cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen) # [0, 5]
cu_seqlens_k.append(cu_seqlens_k[-1] + seqlen) # [0, 5]
# Map to KV cache slots
for token_idx in range(seqlen):
block_idx = token_idx // self.block_size # All in block 0
offset = token_idx % self.block_size # 0, 1, 2, 3, 4
physical_block = seq.block_table[block_idx] # 0
slot = physical_block * self.block_size + offset
slot_mapping.append(slot) # [0, 1, 2, 3, 4]
return input_ids, positions
Result:
input_ids = tensor([1234, 5678, 9012, 3456, 7890])
positions = tensor([0, 1, 2, 3, 4])
slot_mapping = tensor([0, 1, 2, 3, 4])
cu_seqlens_q = tensor([0, 5])
cu_seqlens_k = tensor([0, 5])
Step 4: Run the Model
logits = self.run_model(input_ids, positions, is_prefill=True)
Since this is prefill, eager execution is used:
def run_model(self, input_ids, positions, is_prefill):
if is_prefill:
# Eager execution
hidden_states = self.model(input_ids, positions)
logits = self.model.compute_logits(hidden_states)
return logits
The model processes the 5 tokens:
Input: [1234, 5678, 9012, 3456, 7890]
↓
Embedding: [5, 2048]
↓
Layer 1: Attention + MLP: [5, 2048]
↓
Layer 2: Attention + MLP: [5, 2048]
↓
... (24 layers total)
↓
Final norm: [5, 2048]
↓
compute_logits: [1, vocab_size] (only last token)
The attention kernel writes KV values to the cache:
- Token 0 → slot 0
- Token 1 → slot 1
- Token 2 → slot 2
- Token 3 → slot 3
- Token 4 → slot 4
Result:
logits = tensor([[0.2, -0.1, 0.8, ...]]) # shape: [1, 152064]
Step 5: Sample the Next Token
temperatures = self.prepare_sample(seqs) # [0.8]
token_ids = self.sampler(logits, temperatures).tolist()
# Result: [2345] # "Paris"
Step 6: Update the Sequence
Back in the scheduler:
def postprocess(self, seqs, token_ids):
for seq, token_id in zip(seqs, token_ids):
seq.append_token(token_id)
seq.num_cached_tokens = len(seq) # 6
# Check if finished
if token_id == self.eos or seq.num_completion_tokens >= seq.max_tokens:
seq.status = SequenceStatus.FINISHED
self.block_manager.deallocate(seq)
# Not finished yet, stays in running queue
The sequence now has:
seq.token_ids = [1234, 5678, 9012, 3456, 7890, 2345]
seq.num_cached_tokens = 6
seq.status = SequenceStatus.RUNNING
Step 7: Next Iteration - Decode Phase
The next call to step() schedules again:
seqs, is_prefill = self.scheduler.schedule()
Now the sequence is in the running queue, so it gets scheduled for decode:
# Phase 2: Decode from running sequences
for seq in self.running:
scheduled_seqs.append(seq)
is_prefill = False
return scheduled_seqs, False
ModelRunner prepares for decode:
def prepare_decode(self, seqs):
input_ids = []
positions = []
context_lens = []
slot_mapping = []
for seq in seqs: # Our sequence
# Only the last token
input_ids.append(seq.last_token) # 2345
# Position in sequence
positions.append(len(seq) - 1) # 5
# Attend to all previous tokens
context_lens.append(len(seq)) # 6
# KV slot for new token
slot = seq.block_table[-1] * self.block_size + seq.last_block_num_tokens - 1
# slot = 0 * 64 + 5 = 5
slot_mapping.append(slot)
return input_ids, positions
Result:
input_ids = tensor([2345])
positions = tensor([5])
context_lens = tensor([6])
slot_mapping = tensor([5])
Step 8: Run Decode with CUDA Graph
logits = self.run_model(input_ids, positions, is_prefill=False)
Since this is decode and batch size is 1, CUDA graph is used:
def run_model(self, input_ids, positions, is_prefill):
if not is_prefill and not self.enforce_eager and input_ids.size(0) <= 512:
# CUDA graph
return self.run_cudagraph(input_ids, positions)
def run_cudagraph(self, input_ids, positions):
bs = input_ids.size(0) # 1
# Find the right graph
graph_bs = next(x for x in self.graph_bs if x >= bs) # 1
graph = self.graphs[graph_bs]
# Update static tensors
self.graph_vars['input_ids'][:bs] = input_ids
self.graph_vars['positions'][:bs] = positions
self.graph_vars['slot_mapping'][:bs] = slot_mapping
self.graph_vars['context_lens'][:bs] = context_lens
self.graph_vars['block_tables'][:bs] = block_tables
# Replay the graph (ultra-fast)
graph.replay()
# Results are in graph_vars['outputs']
return self.model.compute_logits(self.graph_vars['outputs'][:bs])
The GPU executes the pre-recorded graph:
- Read token 2345 embedding
- Read KV cache from slots 0-4 (previous tokens)
- Compute attention with new token
- Write new K, V to slot 5
- Compute MLP
- Return logits for next token
Result:
logits = tensor([[0.5, 0.2, -0.3, ...]]) # shape: [1, 152064]
Step 9: Sample and Update
token_ids = self.sampler(logits, temperatures).tolist()
# Result: [5678] # "is"
seq.append_token(5678)
seq.num_cached_tokens = 7
Steps 10-50: Continue Decode
This repeats 48 more times (max_tokens=50, already generated 2):
Step 3: Decode → "the"
Step 4: Decode → "capital"
Step 5: Decode → "of"
...
Step 50: Decode → "." (EOS token)
Each decode step:
- Takes the last token
- Reads KV cache from previous tokens
- Computes attention
- Writes new K, V to cache
- Samples next token
All using the pre-recorded CUDA graph for speed.
Step 51: Sequence Finishes
if token_id == self.eos or seq.num_completion_tokens >= seq.max_tokens:
seq.status = SequenceStatus.FINISHED
self.block_manager.deallocate(seq) # Free block 0
The sequence is complete:
seq.token_ids = [1234, 5678, 9012, 3456, 7890, 2345, 5678, ..., 7890]
Step 52: Return to User
def generate(self, prompt, sampling_params):
self.add_request(prompt, sampling_params)
while not self.is_finished():
output, num_tokens = self.step()
# Decode tokens to text
text = self.tokenizer.decode(seq.token_ids[len(prompt_tokens):])
# Result: "Paris is the capital of France."
return text
The Complete Timeline
T=0ms: Request arrives
T=1ms: Prefill 5 tokens (embedding + 24 layers)
T=150ms: Sample token 1
T=151ms: Decode token 1 (CUDA graph)
T=152ms: Sample token 2
T=153ms: Decode token 2 (CUDA graph)
...
T=200ms: Decode token 49 (CUDA graph)
T=201ms: Sample token 50 (EOS)
T=202ms: Return "Paris is the capital of France."
Total time: ~200ms for a 50-token response.
Key Insights
- Prefill is expensive (~150ms for 5 tokens)
- All tokens processed together
- All layers computed
- KV cache written
- Decode is fast (~1ms per token)
- One token at a time
- CUDA graph eliminates launch overhead
- KV cache read, not written
- Memory is managed carefully
- Block 0 allocated for the sequence
- KV values written to specific slots
- Block deallocated when sequence finishes
- Everything is optimized
- Paged attention for memory efficiency
- CUDA graphs for speed
- Continuous batching for GPU utilization
- Packed-ragged batching for prefill
This is why modern inference engines can serve hundreds of concurrent requests efficiently. Every component—from memory layout to scheduling to kernel optimization—works together to maximize throughput while minimizing latency.