The Model Runner: Orchestrating GPU Inference
The ModelRunner is the heart of an inference engine. It's responsible for taking sequences from the scheduler and actually running them on the GPU. But it's doing far more than just "running the model"—it's managing distributed execution, allocating memory, optimizing with CUDA graphs, and coordinating multiple GPU processes.
What ModelRunner Actually Does
ModelRunner has five major responsibilities:
- Initialize distributed GPU execution — Set up NCCL for multi-GPU communication
- Load the model — Either a full copy or a shard (for tensor parallelism)
- Allocate KV cache memory — Pre-allocate the GPU memory pool for KV cache
- Transform sequences into batches — Convert Sequence objects into tensors for the GPU
- Choose execution mode — Prefill (eager), decode (eager or CUDA graph)
The Rank 0 vs Worker Ranks Split
When you initialize ModelRunner with multiple GPUs, something interesting happens:
if rank == 0:
# Main process: create shared memory
self.shm = SharedMemory(name="nanovllm", create=True, size=2**20)
dist.barrier()
# Returns from __init__, continues to be controlled by LLMEngine
else:
# Worker process: connect and start event loop
dist.barrier()
self.shm = SharedMemory(name="nanovllm")
self.loop() # <-- WORKERS STICK HERE IN AN INFINITE LOOP
Rank 0 is both:
- A real execution rank that does actual forward computation
- The control plane that coordinates all other ranks
Worker ranks (1, 2, 3...) enter an infinite event loop during initialization. They never return from __init__. Instead, they wait for commands from Rank 0 via shared memory.
How Commands Flow Through Shared Memory
When the LLMEngine calls a method on ModelRunner (which only happens on Rank 0), here's what happens:
def call(self, method_name, *args):
if self.world_size > 1 and self.rank == 0:
self.write_shm(method_name, *args) # Broadcast to workers
method = getattr(self, method_name, None)
return method(*args) # Execute locally on Rank 0
Rank 0 writes the method name and arguments to shared memory, then triggers an event. Worker ranks wake up from their read_shm() call, deserialize the command, and execute the same method locally.
This pattern separates the control plane (Rank 0 deciding what to do) from the data plane (all ranks doing the heavy computation).
Prefill vs Decode: Two Different Paths
The run() method is the entry point:
def run(self, seqs: list[Sequence], is_prefill: bool) -> list[int]:
# Prepare batch data
if is_prefill:
input_ids, positions = self.prepare_prefill(seqs)
else:
input_ids, positions = self.prepare_decode(seqs)
# Execute model
logits = self.run_model(input_ids, positions, is_prefill)
# Sample tokens (rank 0 only)
token_ids = self.sampler(logits, temperatures).tolist() if self.rank == 0 else None
return token_ids
Prefill Preparation
During prefill, you have multiple sequences with varying lengths. The goal is to pack all uncached tokens into one batch:
for seq in seqs:
seqlen = len(seq)
# Extract NEW tokens (all tokens since num_cached_tokens=0)
input_ids.extend(seq[seq.num_cached_tokens:])
# Generate REAL positions
positions.extend(list(range(seq.num_cached_tokens, seqlen)))
# Build cumulative lengths for packed batching
cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen_q)
cu_seqlens_k.append(cu_seqlens_k[-1] + seqlen_k)
# Map blocks to KV cache slots
for i in range(seq.num_cached_blocks, seq.num_blocks):
start = seq.block_table[i] * self.block_size
slot_mapping.extend(list(range(start, end)))
The key insight: prefill uses packed-ragged batching. Tokens from different sequences are concatenated, and cu_seqlens_q tells the attention kernel where each sequence's tokens begin and end.
Decode Preparation
Decode is simpler because each sequence contributes exactly one token:
for seq in seqs:
# Each sequence contributes exactly ONE token
input_ids.append(seq.last_token)
positions.append(len(seq) - 1) # Last position in sequence
context_lens.append(len(seq)) # Attend to all previous tokens
# Calculate KV slot for NEW token
slot = seq.block_table[-1] * self.block_size + seq.last_block_num_tokens - 1
slot_mapping.append(slot)
Decode uses fixed one-token-per-sequence batching. This structure is what makes CUDA graph optimization possible.
Preparing Block Tables for GPU
Block tables need to be formatted for GPU consumption. This happens in prepare_block_tables():
max_len = max(len(seq.block_table) for seq in seqs)
block_tables = []
for seq in seqs:
padded = seq.block_table + [-1] * (max_len - len(seq.block_table))
block_tables.append(padded)
return torch.tensor(block_tables)
Why padding with -1? The GPU kernel checks for -1 and skips invalid blocks. This allows all sequences to have the same tensor shape, which GPUs love.
Running the Model: Eager vs CUDA Graph
The run_model() method chooses between two execution paths:
def run_model(self, input_ids, positions, is_prefill):
if is_prefill or self.enforce_eager or input_ids.size(0) > 512:
# Eager execution: launch kernels one by one
return self.model.compute_logits(self.model(input_ids, positions))
else:
# CUDA graph: replay pre-recorded graph
return self.run_cudagraph(input_ids, positions)
Prefill always uses eager execution because the batch structure changes every step (different sequences, different lengths).
Decode can use CUDA graphs because the batch structure is fixed (one token per sequence).
CUDA Graphs: Pre-Recording GPU Work
CUDA graphs are one of the most powerful optimizations in inference. Instead of launching kernels one by one (which has CPU overhead), you record the entire forward pass once, then replay it.
During initialization, capture_cudagraph() runs:
for bs in reversed(self.graph_bs): # [1, 2, 4, 8, 16, 32] in reverse
graph = torch.cuda.CUDAGraph()
# Warmup run to stabilize memory
outputs[:bs] = self.model(input_ids[:bs], positions[:bs])
# Record the forward pass
with torch.cuda.graph(graph, self.graph_pool):
outputs[:bs] = self.model(input_ids[:bs], positions[:bs])
# Save for later replay
self.graphs[bs] = graph
The key insight: CUDA graphs record tensor pointers, not values. So you can update the input tensors and replay the graph with new data:
# At inference time:
bs = actual_batch_size
self.graph_vars['input_ids'][:bs] = new_tokens
self.graph_vars['positions'][:bs] = new_positions
self.graphs[bs].replay() # Ultra-fast
This is why decode is so much faster than prefill—you're replaying a pre-recorded graph instead of launching kernels.
The Complete Execution Flow
Here's how everything fits together:
LLMEngine.step()
↓
Scheduler.schedule() → returns (seqs, is_prefill)
↓
ModelRunner.call("run", seqs, is_prefill)
├─ Rank 0: write_shm() → broadcast to workers
└─ All ranks: execute run()
├─ prepare_prefill() or prepare_decode()
├─ prepare_block_tables()
├─ run_model() → eager or CUDA graph
└─ sampler() → next tokens (rank 0 only)
↓
Scheduler.postprocess() → update sequences
Why This Design?
The ModelRunner's complexity exists for good reasons:
- Multi-GPU efficiency: Shared memory avoids Python GIL overhead, NCCL handles tensor communication
- Memory optimization: Pre-allocated KV cache, paged attention, block reuse
- Execution optimization: CUDA graphs eliminate launch overhead for decode
- Batching flexibility: Packed-ragged batching for prefill, fixed batching for decode
Understanding ModelRunner is understanding how modern inference engines achieve their speed. It's not magic—it's careful engineering at every level.