LLM Inference

Large Language Model Inference & Deployment

Understanding KV Cache and Paged Attention

Learn how modern inference engines manage key-value cache efficiently using paged attention, block tables, and slot mapping.

The Model Runner: Orchestrating GPU Inference

Explore how ModelRunner coordinates multi-GPU execution, manages distributed communication, and optimizes with CUDA graphs.

Tensor Flow Through the Model

Trace how tensors transform from token IDs through embeddings, attention, MLP layers, and finally to logits with concrete examples.

CUDA Graphs and Inference Optimization

Understand how CUDA graphs eliminate kernel launch overhead and achieve 3-5x speedups in decode phase execution.

Scheduling and Continuous Batching

Discover how schedulers keep GPUs busy by interleaving prefill and decode work, managing memory, and handling preemption.

The Complete Inference Pipeline

Trace a complete inference request from arrival to completion, seeing how all components work together in practice.

awesome-nano-vllm

Production-grade vLLM implementation with chunked prefill, mixed-batch execution, continuous batching, and prefix caching.

Back to Tech Sharing Hub