Scheduling and Continuous Batching: Keeping the GPU Busy
In a naive inference system, you'd wait for one sequence to finish generating before starting the next one. This leaves the GPU idle while waiting for I/O or while processing short sequences. Modern inference engines use continuous batching to keep the GPU constantly busy by mixing prefill and decode work.
The Scheduling Problem
Imagine you have three requests:
- Request A: 10-token prompt, want 100 tokens
- Request B: 50-token prompt, want 50 tokens
- Request C: 5-token prompt, want 200 tokens
Naive approach:
Process A's prefill (10 tokens)
Process A's decode (100 tokens, one at a time)
Process B's prefill (50 tokens)
Process B's decode (50 tokens, one at a time)
Process C's prefill (5 tokens)
Process C's decode (200 tokens, one at a time)
The GPU is idle between requests. Worse, during decode, you're processing one sequence at a time, wasting GPU parallelism.
Continuous Batching: The Solution
Continuous batching interleaves prefill and decode work:
Step 1: Prefill A (10 tokens) + Prefill B (50 tokens) = 60 tokens
Step 2: Decode A (1 token) + Decode B (1 token) + Prefill C (5 tokens) = 7 tokens
Step 3: Decode A (1 token) + Decode B (1 token) + Decode C (1 token) = 3 tokens
Step 4: Decode A (1 token) + Decode C (1 token) = 2 tokens (B finished)
Step 5: Decode A (1 token) + Decode C (1 token) = 2 tokens
...
The GPU is constantly busy, and you're batching multiple sequences together.
How the Scheduler Works
The Scheduler manages three queues:
self.waiting = deque() # New requests waiting to start
self.running = deque() # Sequences currently being processed
self.finished = deque() # Completed sequences
The schedule() method is called every step:
def schedule(self):
scheduled_seqs = []
num_seqs = 0
num_batched_tokens = 0
# Phase 1: Try to prefill new sequences
while self.waiting and num_seqs < self.max_num_seqs:
seq = self.waiting[0]
# Check if we have space
if num_batched_tokens + len(seq) > self.max_num_batched_tokens:
break # No space for this sequence
# Allocate KV cache blocks
if not self.block_manager.allocate(seq):
break # Out of memory
# Move to running
seq.status = SequenceStatus.RUNNING
self.waiting.popleft()
scheduled_seqs.append(seq)
num_batched_tokens += len(seq)
num_seqs += 1
# Phase 2: Add decode work from running sequences
for seq in self.running:
if num_seqs >= self.max_num_seqs:
break
if num_batched_tokens + 1 > self.max_num_batched_tokens:
break
scheduled_seqs.append(seq)
num_batched_tokens += 1
num_seqs += 1
# Return scheduled sequences and whether this is prefill
is_prefill = len(scheduled_seqs) > 0 and scheduled_seqs[0].status == SequenceStatus.WAITING
return scheduled_seqs, is_prefill
Prefill vs Decode Scheduling
The scheduler treats prefill and decode differently:
Prefill Phase:
- New sequences from the waiting queue
- All uncached tokens are processed
- Example: 10-token prompt → 10 tokens processed
Decode Phase:
- Sequences already in the running queue
- Exactly one token per sequence
- Example: 32 sequences → 32 tokens processed
The key insight: Prefill is latency-critical, decode is throughput-critical.
Prefill should be done as quickly as possible to get the sequence into the running queue. Decode should be batched as much as possible to maximize GPU utilization.
Token Budget: The Constraint
The scheduler respects two constraints:
- max_num_seqs: Maximum number of sequences in a batch
- max_num_batched_tokens: Maximum total tokens per batch
For example:
- max_num_seqs = 256
- max_num_batched_tokens = 8192
This means you can have at most 256 sequences, but the total tokens across all sequences can't exceed 8192.
Why the token budget? GPU memory is limited. Each token needs space in:
- Input embeddings
- Attention Q, K, V
- MLP activations
- KV cache slots
The token budget ensures you don't run out of memory.
Memory Management: Block Manager
The Block Manager allocates KV cache blocks to sequences:
def allocate(self, seq):
# Calculate blocks needed
num_blocks_needed = (len(seq) + self.block_size - 1) // self.block_size
# Check if we have free blocks
if len(self.free_blocks) < num_blocks_needed:
return False # Out of memory
# Allocate blocks
for _ in range(num_blocks_needed):
block_id = self.free_blocks.pop()
seq.block_table.append(block_id)
return True
When a sequence finishes, its blocks are deallocated:
def deallocate(self, seq):
for block_id in seq.block_table:
self.free_blocks.append(block_id)
seq.block_table.clear()
Chunked Prefill: Processing Large Prompts
What if a prompt is 10,000 tokens? You can't process it all at once—it would exceed the token budget.
Chunked prefill breaks it into smaller chunks:
# Prompt: 10,000 tokens
# max_num_batched_tokens: 8192
# Step 1: Process first 8192 tokens
# Step 2: Process next 1808 tokens
The sequence stays in the running queue between chunks. This allows other sequences to be processed while you're chunking the large prompt.
Preemption: When Memory Runs Out
Sometimes you can't fit a new sequence even after deallocating finished ones. The scheduler can preempt running sequences:
def preempt(self, seq):
# Move sequence back to waiting queue
seq.status = SequenceStatus.WAITING
self.running.remove(seq)
self.waiting.appendleft(seq) # High priority
# Deallocate its KV cache blocks
self.block_manager.deallocate(seq)
The sequence loses its KV cache and has to be re-prefilled later. This is a tradeoff: you free memory now, but you'll have to recompute the prefill later.
Post-Processing: Updating Sequences
After ModelRunner returns the next tokens, the scheduler updates sequences:
def postprocess(self, seqs, token_ids):
for seq, token_id in zip(seqs, token_ids):
# Add new token
seq.append_token(token_id)
seq.num_cached_tokens = len(seq)
# Check if finished
if token_id == self.eos or seq.num_completion_tokens >= seq.max_tokens:
seq.status = SequenceStatus.FINISHED
self.block_manager.deallocate(seq)
self.running.remove(seq)
The Complete Scheduling Loop
Here's how everything fits together:
LLMEngine.step()
↓
Scheduler.schedule()
├─ Try to prefill new sequences from waiting queue
├─ Fill remaining capacity with decode from running queue
└─ Return (scheduled_seqs, is_prefill)
↓
ModelRunner.run(scheduled_seqs, is_prefill)
├─ prepare_prefill() or prepare_decode()
├─ run_model()
└─ Return token_ids
↓
Scheduler.postprocess(scheduled_seqs, token_ids)
├─ Update sequences with new tokens
├─ Check if finished
└─ Deallocate finished sequences
↓
Repeat until all sequences finished
Why This Design?
Continuous batching achieves several goals:
- GPU utilization: The GPU is always busy, never idle
- Fairness: New requests start quickly (prefill is prioritized)
- Memory efficiency: Token budget prevents OOM
- Flexibility: Sequences can be added/removed dynamically
The scheduler is the orchestrator that balances these concerns.
Real-World Example
Let's trace a concrete example:
Initial state:
- waiting: [A(10), B(50), C(5)]
- running: []
- free_blocks: [0, 1, 2, ..., 1319]
Step 1: schedule()
- Prefill A (10 tokens) + B (50 tokens) = 60 tokens
- Allocate blocks for A and B
- scheduled_seqs = [A, B], is_prefill = True
Step 1: run_model()
- Process 60 tokens through model
- Return token_ids = [next_token_A, next_token_B]
Step 1: postprocess()
- A: append token, num_cached_tokens = 11
- B: append token, num_cached_tokens = 51
- A and B move to running queue
Step 2: schedule()
- Prefill C (5 tokens)
- Decode A (1 token) + B (1 token)
- Total: 5 + 1 + 1 = 7 tokens
- scheduled_seqs = [C, A, B], is_prefill = True (C is prefilling)
Step 2: run_model()
- Process 7 tokens (5 prefill + 2 decode)
- Return token_ids = [next_token_C, next_token_A, next_token_B]
Step 2: postprocess()
- C: append token, num_cached_tokens = 6
- A: append token, num_cached_tokens = 12
- B: append token, num_cached_tokens = 52
- All three in running queue
Step 3: schedule()
- Decode A (1 token) + B (1 token) + C (1 token)
- Total: 3 tokens
- scheduled_seqs = [A, B, C], is_prefill = False
...continue until all sequences finish...
This is continuous batching in action: prefill and decode are interleaved, the GPU is always busy, and new requests start quickly.