Back to LLM Inference

Scheduling and Continuous Batching: Keeping the GPU Busy

In a naive inference system, you'd wait for one sequence to finish generating before starting the next one. This leaves the GPU idle while waiting for I/O or while processing short sequences. Modern inference engines use continuous batching to keep the GPU constantly busy by mixing prefill and decode work.

The Scheduling Problem

Imagine you have three requests:

Naive approach:

Process A's prefill (10 tokens)
Process A's decode (100 tokens, one at a time)
Process B's prefill (50 tokens)
Process B's decode (50 tokens, one at a time)
Process C's prefill (5 tokens)
Process C's decode (200 tokens, one at a time)

The GPU is idle between requests. Worse, during decode, you're processing one sequence at a time, wasting GPU parallelism.

Continuous Batching: The Solution

Continuous batching interleaves prefill and decode work:

Step 1: Prefill A (10 tokens) + Prefill B (50 tokens) = 60 tokens
Step 2: Decode A (1 token) + Decode B (1 token) + Prefill C (5 tokens) = 7 tokens
Step 3: Decode A (1 token) + Decode B (1 token) + Decode C (1 token) = 3 tokens
Step 4: Decode A (1 token) + Decode C (1 token) = 2 tokens (B finished)
Step 5: Decode A (1 token) + Decode C (1 token) = 2 tokens
...

The GPU is constantly busy, and you're batching multiple sequences together.

How the Scheduler Works

The Scheduler manages three queues:

self.waiting = deque()   # New requests waiting to start
self.running = deque()   # Sequences currently being processed
self.finished = deque()  # Completed sequences

The schedule() method is called every step:

def schedule(self):
    scheduled_seqs = []
    num_seqs = 0
    num_batched_tokens = 0
    
    # Phase 1: Try to prefill new sequences
    while self.waiting and num_seqs < self.max_num_seqs:
        seq = self.waiting[0]
        
        # Check if we have space
        if num_batched_tokens + len(seq) > self.max_num_batched_tokens:
            break  # No space for this sequence
        
        # Allocate KV cache blocks
        if not self.block_manager.allocate(seq):
            break  # Out of memory
        
        # Move to running
        seq.status = SequenceStatus.RUNNING
        self.waiting.popleft()
        scheduled_seqs.append(seq)
        num_batched_tokens += len(seq)
        num_seqs += 1
    
    # Phase 2: Add decode work from running sequences
    for seq in self.running:
        if num_seqs >= self.max_num_seqs:
            break
        if num_batched_tokens + 1 > self.max_num_batched_tokens:
            break
        
        scheduled_seqs.append(seq)
        num_batched_tokens += 1
        num_seqs += 1
    
    # Return scheduled sequences and whether this is prefill
    is_prefill = len(scheduled_seqs) > 0 and scheduled_seqs[0].status == SequenceStatus.WAITING
    return scheduled_seqs, is_prefill

Prefill vs Decode Scheduling

The scheduler treats prefill and decode differently:

Prefill Phase:

Decode Phase:

The key insight: Prefill is latency-critical, decode is throughput-critical.

Prefill should be done as quickly as possible to get the sequence into the running queue. Decode should be batched as much as possible to maximize GPU utilization.

Token Budget: The Constraint

The scheduler respects two constraints:

  1. max_num_seqs: Maximum number of sequences in a batch
  2. max_num_batched_tokens: Maximum total tokens per batch

For example:

This means you can have at most 256 sequences, but the total tokens across all sequences can't exceed 8192.

Why the token budget? GPU memory is limited. Each token needs space in:

The token budget ensures you don't run out of memory.

Memory Management: Block Manager

The Block Manager allocates KV cache blocks to sequences:

def allocate(self, seq):
    # Calculate blocks needed
    num_blocks_needed = (len(seq) + self.block_size - 1) // self.block_size
    
    # Check if we have free blocks
    if len(self.free_blocks) < num_blocks_needed:
        return False  # Out of memory
    
    # Allocate blocks
    for _ in range(num_blocks_needed):
        block_id = self.free_blocks.pop()
        seq.block_table.append(block_id)
    
    return True

When a sequence finishes, its blocks are deallocated:

def deallocate(self, seq):
    for block_id in seq.block_table:
        self.free_blocks.append(block_id)
    seq.block_table.clear()

Chunked Prefill: Processing Large Prompts

What if a prompt is 10,000 tokens? You can't process it all at once—it would exceed the token budget.

Chunked prefill breaks it into smaller chunks:

# Prompt: 10,000 tokens
# max_num_batched_tokens: 8192

# Step 1: Process first 8192 tokens
# Step 2: Process next 1808 tokens

The sequence stays in the running queue between chunks. This allows other sequences to be processed while you're chunking the large prompt.

Preemption: When Memory Runs Out

Sometimes you can't fit a new sequence even after deallocating finished ones. The scheduler can preempt running sequences:

def preempt(self, seq):
    # Move sequence back to waiting queue
    seq.status = SequenceStatus.WAITING
    self.running.remove(seq)
    self.waiting.appendleft(seq)  # High priority
    
    # Deallocate its KV cache blocks
    self.block_manager.deallocate(seq)

The sequence loses its KV cache and has to be re-prefilled later. This is a tradeoff: you free memory now, but you'll have to recompute the prefill later.

Post-Processing: Updating Sequences

After ModelRunner returns the next tokens, the scheduler updates sequences:

def postprocess(self, seqs, token_ids):
    for seq, token_id in zip(seqs, token_ids):
        # Add new token
        seq.append_token(token_id)
        seq.num_cached_tokens = len(seq)
        
        # Check if finished
        if token_id == self.eos or seq.num_completion_tokens >= seq.max_tokens:
            seq.status = SequenceStatus.FINISHED
            self.block_manager.deallocate(seq)
            self.running.remove(seq)

The Complete Scheduling Loop

Here's how everything fits together:

LLMEngine.step()
    ↓
Scheduler.schedule()
    ├─ Try to prefill new sequences from waiting queue
    ├─ Fill remaining capacity with decode from running queue
    └─ Return (scheduled_seqs, is_prefill)
    ↓
ModelRunner.run(scheduled_seqs, is_prefill)
    ├─ prepare_prefill() or prepare_decode()
    ├─ run_model()
    └─ Return token_ids
    ↓
Scheduler.postprocess(scheduled_seqs, token_ids)
    ├─ Update sequences with new tokens
    ├─ Check if finished
    └─ Deallocate finished sequences
    ↓
Repeat until all sequences finished

Why This Design?

Continuous batching achieves several goals:

  1. GPU utilization: The GPU is always busy, never idle
  2. Fairness: New requests start quickly (prefill is prioritized)
  3. Memory efficiency: Token budget prevents OOM
  4. Flexibility: Sequences can be added/removed dynamically

The scheduler is the orchestrator that balances these concerns.

Real-World Example

Let's trace a concrete example:

Initial state:
- waiting: [A(10), B(50), C(5)]
- running: []
- free_blocks: [0, 1, 2, ..., 1319]

Step 1: schedule()
- Prefill A (10 tokens) + B (50 tokens) = 60 tokens
- Allocate blocks for A and B
- scheduled_seqs = [A, B], is_prefill = True

Step 1: run_model()
- Process 60 tokens through model
- Return token_ids = [next_token_A, next_token_B]

Step 1: postprocess()
- A: append token, num_cached_tokens = 11
- B: append token, num_cached_tokens = 51
- A and B move to running queue

Step 2: schedule()
- Prefill C (5 tokens)
- Decode A (1 token) + B (1 token)
- Total: 5 + 1 + 1 = 7 tokens
- scheduled_seqs = [C, A, B], is_prefill = True (C is prefilling)

Step 2: run_model()
- Process 7 tokens (5 prefill + 2 decode)
- Return token_ids = [next_token_C, next_token_A, next_token_B]

Step 2: postprocess()
- C: append token, num_cached_tokens = 6
- A: append token, num_cached_tokens = 12
- B: append token, num_cached_tokens = 52
- All three in running queue

Step 3: schedule()
- Decode A (1 token) + B (1 token) + C (1 token)
- Total: 3 tokens
- scheduled_seqs = [A, B, C], is_prefill = False

...continue until all sequences finish...

This is continuous batching in action: prefill and decode are interleaved, the GPU is always busy, and new requests start quickly.