Back to LLM Inference

The Model Runner: Orchestrating GPU Inference

The ModelRunner is the heart of an inference engine. It's responsible for taking sequences from the scheduler and actually running them on the GPU. But it's doing far more than just "running the model"—it's managing distributed execution, allocating memory, optimizing with CUDA graphs, and coordinating multiple GPU processes.

What ModelRunner Actually Does

ModelRunner has five major responsibilities:

Initialize distributed GPU execution — Set up NCCL for multi-GPU communication
Load the model — Either a full copy or a shard (for tensor parallelism)
Allocate KV cache memory — Pre-allocate the GPU memory pool for KV cache
Transform sequences into batches — Convert Sequence objects into tensors for the GPU
Choose execution mode — Prefill (eager), decode (eager or CUDA graph)

The Rank 0 vs Worker Ranks Split

When you initialize ModelRunner with multiple GPUs, something interesting happens:

if rank == 0:
    # Main process: create shared memory
    self.shm = SharedMemory(name="nanovllm", create=True, size=2**20)
    dist.barrier()
    # Returns from __init__, continues to be controlled by LLMEngine
else:
    # Worker process: connect and start event loop
    dist.barrier()
    self.shm = SharedMemory(name="nanovllm")
    self.loop()  # <-- WORKERS STICK HERE IN AN INFINITE LOOP

Rank 0 is both:

A real execution rank that does actual forward computation
The control plane that coordinates all other ranks

Worker ranks (1, 2, 3...) enter an infinite event loop during initialization. They never return from __init__. Instead, they wait for commands from Rank 0 via shared memory.

How Commands Flow Through Shared Memory

When the LLMEngine calls a method on ModelRunner (which only happens on Rank 0), here's what happens:

def call(self, method_name, *args):
    if self.world_size > 1 and self.rank == 0:
        self.write_shm(method_name, *args)  # Broadcast to workers
    
    method = getattr(self, method_name, None)
    return method(*args)  # Execute locally on Rank 0

Rank 0 writes the method name and arguments to shared memory, then triggers an event. Worker ranks wake up from their read_shm() call, deserialize the command, and execute the same method locally.

This pattern separates the control plane (Rank 0 deciding what to do) from the data plane (all ranks doing the heavy computation).

Prefill vs Decode: Two Different Paths

The run() method is the entry point:

def run(self, seqs: list[Sequence], is_prefill: bool) -> list[int]:
    # Prepare batch data
    if is_prefill:
        input_ids, positions = self.prepare_prefill(seqs)
    else:
        input_ids, positions = self.prepare_decode(seqs)
    
    # Execute model
    logits = self.run_model(input_ids, positions, is_prefill)
    
    # Sample tokens (rank 0 only)
    token_ids = self.sampler(logits, temperatures).tolist() if self.rank == 0 else None
    
    return token_ids

Prefill Preparation

During prefill, you have multiple sequences with varying lengths. The goal is to pack all uncached tokens into one batch:

for seq in seqs:
    seqlen = len(seq)
    
    # Extract NEW tokens (all tokens since num_cached_tokens=0)
    input_ids.extend(seq[seq.num_cached_tokens:])
    
    # Generate REAL positions
    positions.extend(list(range(seq.num_cached_tokens, seqlen)))
    
    # Build cumulative lengths for packed batching
    cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen_q)
    cu_seqlens_k.append(cu_seqlens_k[-1] + seqlen_k)
    
    # Map blocks to KV cache slots
    for i in range(seq.num_cached_blocks, seq.num_blocks):
        start = seq.block_table[i] * self.block_size
        slot_mapping.extend(list(range(start, end)))

The key insight: prefill uses packed-ragged batching. Tokens from different sequences are concatenated, and cu_seqlens_q tells the attention kernel where each sequence's tokens begin and end.

Decode Preparation

Decode is simpler because each sequence contributes exactly one token:

for seq in seqs:
    # Each sequence contributes exactly ONE token
    input_ids.append(seq.last_token)
    positions.append(len(seq) - 1)  # Last position in sequence
    context_lens.append(len(seq))   # Attend to all previous tokens
    
    # Calculate KV slot for NEW token
    slot = seq.block_table[-1] * self.block_size + seq.last_block_num_tokens - 1
    slot_mapping.append(slot)

Decode uses fixed one-token-per-sequence batching. This structure is what makes CUDA graph optimization possible.

Preparing Block Tables for GPU

Block tables need to be formatted for GPU consumption. This happens in prepare_block_tables():

max_len = max(len(seq.block_table) for seq in seqs)
block_tables = []
for seq in seqs:
    padded = seq.block_table + [-1] * (max_len - len(seq.block_table))
    block_tables.append(padded)
return torch.tensor(block_tables)

Why padding with -1? The GPU kernel checks for -1 and skips invalid blocks. This allows all sequences to have the same tensor shape, which GPUs love.

Running the Model: Eager vs CUDA Graph

The run_model() method chooses between two execution paths:

def run_model(self, input_ids, positions, is_prefill):
    if is_prefill or self.enforce_eager or input_ids.size(0) > 512:
        # Eager execution: launch kernels one by one
        return self.model.compute_logits(self.model(input_ids, positions))
    else:
        # CUDA graph: replay pre-recorded graph
        return self.run_cudagraph(input_ids, positions)

Prefill always uses eager execution because the batch structure changes every step (different sequences, different lengths).

Decode can use CUDA graphs because the batch structure is fixed (one token per sequence).

CUDA Graphs: Pre-Recording GPU Work

CUDA graphs are one of the most powerful optimizations in inference. Instead of launching kernels one by one (which has CPU overhead), you record the entire forward pass once, then replay it.

During initialization, capture_cudagraph() runs:

for bs in reversed(self.graph_bs):  # [1, 2, 4, 8, 16, 32] in reverse
    graph = torch.cuda.CUDAGraph()
    
    # Warmup run to stabilize memory
    outputs[:bs] = self.model(input_ids[:bs], positions[:bs])
    
    # Record the forward pass
    with torch.cuda.graph(graph, self.graph_pool):
        outputs[:bs] = self.model(input_ids[:bs], positions[:bs])
    
    # Save for later replay
    self.graphs[bs] = graph

The key insight: CUDA graphs record tensor pointers, not values. So you can update the input tensors and replay the graph with new data:

# At inference time:
bs = actual_batch_size
self.graph_vars['input_ids'][:bs] = new_tokens
self.graph_vars['positions'][:bs] = new_positions
self.graphs[bs].replay()  # Ultra-fast

This is why decode is so much faster than prefill—you're replaying a pre-recorded graph instead of launching kernels.

The Complete Execution Flow

Here's how everything fits together:

LLMEngine.step()
    ↓
Scheduler.schedule() → returns (seqs, is_prefill)
    ↓
ModelRunner.call("run", seqs, is_prefill)
    ├─ Rank 0: write_shm() → broadcast to workers
    └─ All ranks: execute run()
        ├─ prepare_prefill() or prepare_decode()
        ├─ prepare_block_tables()
        ├─ run_model() → eager or CUDA graph
        └─ sampler() → next tokens (rank 0 only)
    ↓
Scheduler.postprocess() → update sequences

Why This Design?

The ModelRunner's complexity exists for good reasons:

Multi-GPU efficiency: Shared memory avoids Python GIL overhead, NCCL handles tensor communication
Memory optimization: Pre-allocated KV cache, paged attention, block reuse
Execution optimization: CUDA graphs eliminate launch overhead for decode
Batching flexibility: Packed-ragged batching for prefill, fixed batching for decode

Understanding ModelRunner is understanding how modern inference engines achieve their speed. It's not magic—it's careful engineering at every level.