The math behind how LLMs are trained and served – Reiner Pope
Reiner Pope, CEO of chip startup MatX and former Google TPU architect, delivers a blackboard lecture explaining the mathematics behind LLM training and inference. He covers roofline analysis, batch size economics, memory bandwidth constraints, mixture-of-experts architectures, parallelism strategies, and how these fundamentals explain API pricing, context length limits, and AI model scaling trends.
Summary
Reiner Pope walks through the core mathematics governing how large language models are trained and served, using a roofline analysis framework applied to a Blackwell NVL72 GPU cluster. He introduces two key factors: the time to operate on model weights and the time to operate on the KV cache (context). The roofline model reveals that inference time is bounded below by either memory fetch time or compute time, whichever is greater.
Pope explains that batch size is the dominant lever for balancing cost and latency. At small batch sizes, the cost per token is extremely high because weight fetches are not amortized across many users. As batch size grows, the weight fetch cost is spread across more tokens until compute becomes the bottleneck, establishing a lower bound on cost. He derives that the optimal batch size—where memory and compute are balanced—is approximately 300 times the model's sparsity ratio (active parameters divided by total parameters), yielding roughly 2,000 tokens per batch for a model like DeepSeek V3.
The lecture then addresses mixture-of-experts (MoE) architectures. Pope explains the router-and-expert structure and how expert parallelism maps naturally onto a GPU rack, with different experts assigned to different GPUs. This creates an all-to-all communication pattern that fits well within a single rack's NVLink interconnect but becomes a bottleneck across rack boundaries, where bandwidth is roughly 8x slower. This physical constraint means one rack effectively bounds the size of an expert layer, which has been a major driver of demand for larger scale-up domains.
Pope analyzes pipeline parallelism as a solution to memory capacity constraints, showing that pipelining across racks reduces the per-rack weight storage requirement but does not reduce the KV cache memory footprint—because the number of in-flight sequences must grow proportionally to keep all pipeline stages busy. He argues that for inference, expert parallelism within a rack is the dominant strategy, with minimal pipelining only to manage weight storage.
The discussion extends to how scale-up domain size affects memory bandwidth, which is the true limiting factor for long-context inference. Larger scale-up domains allow weight matrices to be loaded in parallel across more GPUs, dramatically improving throughput. This explains why Gemini, running on Google's TPUs with historically larger scale-up domains, appeared ahead in certain capabilities.
Pope uses these principles to reverse-engineer information from public API pricing: Gemini's 50% price increase beyond 200K tokens reveals approximately where compute and memory bandwidth costs equalize; the 5x price difference between output and input tokens confirms that decode is heavily memory-bandwidth-bound while prefill is compute-bound; and KV cache pricing tiers (5-minute vs. 1-hour) can be used to identify which memory tier (flash vs. spinning disk) is being used for storage, based on the drain time of each memory tier.
He also derives a first-principles estimate suggesting frontier models are trained on roughly 100x more tokens than Chinchilla-optimal, driven by the need to equalize training compute cost with inference compute cost across the model's deployment lifetime. Finally, Pope discusses the architectural convergence between cryptographic hash functions and neural networks—both require mixing information across inputs—and describes how the Feistel cipher construction was imported into neural networks as 'RevNets,' enabling fully invertible networks that eliminate the need to store activations during training at the cost of additional compute.
Key Insights
- Pope derives that the optimal batch size for amortizing weight fetches against compute is approximately 300 times the model's sparsity ratio—roughly 2,000 tokens for DeepSeek V3—and argues that failing to batch users together can make inference economics up to 1,000 times worse than at optimal batch size.
- Pope argues that one GPU rack physically bounds the size of a mixture-of-experts layer because the all-to-all communication pattern required by expert parallelism only works efficiently within a rack's NVLink fabric; crossing rack boundaries to a scale-out network that is 8x slower creates a hard bottleneck, which is what has been driving demand for ever-larger scale-up interconnect domains.
- Pope shows that pipeline parallelism reduces per-rack weight storage requirements but does not reduce KV cache memory footprint per GPU, because the number of in-flight micro-batches must equal the number of pipeline stages, exactly canceling the capacity saving—meaning pipelining is useful for weights but cannot solve the memory wall for long-context inference.
- Pope reverse-engineers from Gemini's public API pricing that the 50% price increase beyond 200K tokens corresponds to where KV cache memory bandwidth costs overtake compute costs, and that the 5x price difference between output and input tokens reveals that decode is heavily memory-bandwidth-bottlenecked while prefill is compute-bound.
- Pope derives a first-principles estimate that frontier models are trained on approximately 100x more tokens than Chinchilla-optimal, concluding from the heuristic that training compute cost, RL compute cost, and inference compute cost should be roughly equalized—implying that the total inference tokens served over a model's deployment lifetime should approximately equal the number of pre-training tokens.
Topics
Full transcript available for MurmurCast members
Sign Up to Access