The math behind how LLMs are trained and served – Reiner Pope Summary — Dwarkesh Patel

Reiner Pope walks through the core mathematics governing how large language models are trained and served, using a roofline analysis framework applied to a Blackwell NVL72 GPU cluster. He introduces two key factors: the time to operate on model weights and the time to operate on the KV cache (context). The roofline model reveals that inference time is bounded below by either memory fetch time or compute time, whichever is greater.

Pope explains that batch size is the dominant lever for balancing cost and latency. At small batch sizes, the cost per token is extremely high because weight fetches are not amortized across many users. As batch size grows, the weight fetch cost is spread across more tokens until compute becomes the bottleneck, establishing a lower bound on cost. He derives that the optimal batch size—where memory and compute are balanced—is approximately 300 times the model's sparsity ratio (active parameters divided by total parameters), yielding roughly 2,000 tokens per batch for a model like DeepSeek V3.

The lecture then addresses mixture-of-experts (MoE) architectures. Pope explains the router-and-expert structure and how expert parallelism maps naturally onto a GPU rack, with different experts assigned to different GPUs. This creates an all-to-all communication pattern that fits well within a single rack's NVLink interconnect but becomes a bottleneck across rack boundaries, where bandwidth is roughly 8x slower. This physical constraint means one rack effectively bounds the size of an expert layer, which has been a major driver of demand for larger scale-up domains.

Pope analyzes pipeline parallelism as a solution to memory capacity constraints, showing that pipelining across racks reduces the per-rack weight storage requirement but does not reduce the KV cache memory footprint—because the number of in-flight sequences must grow proportionally to keep all pipeline stages busy. He argues that for inference, expert parallelism within a rack is the dominant strategy, with minimal pipelining only to manage weight storage.

The discussion extends to how scale-up domain size affects memory bandwidth, which is the true limiting factor for long-context inference. Larger scale-up domains allow weight matrices to be loaded in parallel across more GPUs, dramatically improving throughput. This explains why Gemini, running on Google's TPUs with historically larger scale-up domains, appeared ahead in certain capabilities.

Pope uses these principles to reverse-engineer information from public API pricing: Gemini's 50% price increase beyond 200K tokens reveals approximately where compute and memory bandwidth costs equalize; the 5x price difference between output and input tokens confirms that decode is heavily memory-bandwidth-bound while prefill is compute-bound; and KV cache pricing tiers (5-minute vs. 1-hour) can be used to identify which memory tier (flash vs. spinning disk) is being used for storage, based on the drain time of each memory tier.

He also derives a first-principles estimate suggesting frontier models are trained on roughly 100x more tokens than Chinchilla-optimal, driven by the need to equalize training compute cost with inference compute cost across the model's deployment lifetime. Finally, Pope discusses the architectural convergence between cryptographic hash functions and neural networks—both require mixing information across inputs—and describes how the Feistel cipher construction was imported into neural networks as 'RevNets,' enabling fully invertible networks that eliminate the need to store activations during training at the cost of additional compute.

The math behind how LLMs are trained and served – Reiner Pope

Summary

Key Insights

Topics

Get AI summaries delivered to your inbox