LLM Inference — Transcendence

Learn how modern inference engines manage key-value cache efficiently using paged attention, block tables, and slot mapping.

Memory Management

Explore how ModelRunner coordinates multi-GPU execution, manages distributed communication, and optimizes with CUDA graphs.

GPU Orchestration

Trace how tensors transform from token IDs through embeddings, attention, MLP layers, and finally to logits with concrete examples.

Model Architecture

Understand how CUDA graphs eliminate kernel launch overhead and achieve 3-5x speedups in decode phase execution.

Performance Optimization

Discover how schedulers keep GPUs busy by interleaving prefill and decode work, managing memory, and handling preemption.

Batch Scheduling

Trace a complete inference request from arrival to completion, seeing how all components work together in practice.

End-to-End Flow

Production-grade vLLM implementation with chunked prefill, mixed-batch execution, continuous batching, and prefix caching.

LLM Inference GitHub Project