Large Language Model Inference & Deployment
Learn how modern inference engines manage key-value cache efficiently using paged attention, block tables, and slot mapping.
Explore how ModelRunner coordinates multi-GPU execution, manages distributed communication, and optimizes with CUDA graphs.
Trace how tensors transform from token IDs through embeddings, attention, MLP layers, and finally to logits with concrete examples.
Understand how CUDA graphs eliminate kernel launch overhead and achieve 3-5x speedups in decode phase execution.
Discover how schedulers keep GPUs busy by interleaving prefill and decode work, managing memory, and handling preemption.
Trace a complete inference request from arrival to completion, seeing how all components work together in practice.
Production-grade vLLM implementation with chunked prefill, mixed-batch execution, continuous batching, and prefix caching.