TechnicalDiscussion

Shocking performance boost of assembly code: ~100x faster than C code | Lex Fridman Podcast

Lex Clips

Developers from the FFmpeg/VLC ecosystem explain why handwritten SIMD assembly code can outperform C by 10-100x, using the AV1 decoder 'David' (240,000 lines of handwritten assembly) as a prime example. They argue that as Moore's Law slows and hardware gains plateau, low-level optimization becomes increasingly critical. The conversation challenges the widespread assumption that modern compilers with auto-vectorization can match hand-crafted assembly.

Summary

The conversation opens with a demonstration that handwritten assembly code achieved a 62x speed improvement over equivalent C code, challenging the common industry assumption that C is 'fast enough.' The speakers, who have built companies around the FFmpeg/VLC ethos, explain that most companies accept C as sufficiently performant, but significant gains remain on the table.

They discuss the extraordinary OS compatibility maintained by VLC and FFmpeg — supporting everything from Windows XP to Windows 11, Mac OS 10.7 to the latest versions, iOS 9, and even OS/2 — achieved by a very small team with fewer resources than Microsoft, Google, or Apple. Supporting legacy platforms like iOS 9 requires creative 'Frankenstein' builds combining multiple Xcode versions.

The technical core of the discussion focuses on SIMD (Single Instruction Multiple Data) assembly, which allows a single instruction to operate on a vector of 16 numbers simultaneously rather than one at a time — ideal for video processing on pixel grids. The speakers argue that handwritten SIMD consistently outperforms compiler-generated code and intrinsics by orders of magnitude, not merely a few percent, despite ongoing claims from the software community that modern auto-vectorization closes the gap.

The flagship example is 'David,' an open-source AV1 software decoder that is 79.9% handwritten assembly (240,000 lines) versus only 30,000 lines of C — described as likely one of the largest assembly codebases ever. David was created because the Alliance for Open Media (Google, Netflix, Amazon, Mozilla) claimed AV1 was too complex for software decoding, yet David achieves 720p decoding on just one or two CPU cores. It runs on roughly 3 billion devices, relevant given that 30% of Netflix and 50% of YouTube traffic is now AV1.

The speakers describe extreme low-level optimizations in David, including creating custom calling conventions that bypass standard OS conventions to avoid unnecessary register saves/loads, using cryptography instructions for unrelated video processing tasks, and implementing runtime CPU feature detection to set appropriate function pointers for each architecture (x86 AVX-512, ARM64, RISC-V, SVE, SME, etc.).

Philosophically, the speakers argue that the end of Moore's Law means hardware speed is no longer advancing rapidly enough to compensate for inefficient software. The value of low-level optimization will grow as AI inference, real-time processing, and cost constraints push developers back down the stack. They draw a parallel to LLM quantization (FP8, FP4, 1-bit weights) as another domain where hardware constraints force deep optimization. The conversation concludes with the assertion that vibe coding and AI-assisted programming will handle business logic, but hardware-level optimization remains something that cannot be automated away.

Key Insights

  • A handwritten SIMD assembly function achieved a 62x speed improvement over equivalent C code, which the speakers cite as a concrete demonstration that C is not inherently fast and that compiler auto-vectorization is 'not even close' — not 5-10% slower but multiple times slower.
  • The AV1 decoder 'David' contains 240,000 lines of handwritten assembly versus only 30,000 lines of C, making it likely one of the largest handwritten assembly codebases in existence, and was built because even the Alliance for Open Media (Google, Netflix, Amazon) claimed AV1 was too complex for software-only decoding.
  • David deliberately violates standard OS calling conventions to avoid unnecessary register saves and loads to L1/L2 cache, instead defining its own internal calling convention — a technique the speakers say they have never seen in any other mass-deployed project running on billions of devices.
  • The speakers argue that the end of Moore's Law fundamentally changes the calculus of software development: because hardware is no longer getting dramatically faster and adding more cores has limits, the value of low-level assembly optimization will increase as CPU, RAM, and networking constraints become binding.
  • VLC and FFmpeg, maintained by a very small team, support more operating systems than Microsoft, Google, or Apple combined — including OS/2 which has roughly 10 users worldwide — with one maintainer among them, illustrating the ethos that optimization allows old hardware to remain useful rather than forcing unnecessary upgrades.

Topics

Handwritten SIMD assembly vs. C performanceDavid AV1 decoder architectureVLC/FFmpeg cross-platform OS supportEnd of Moore's Law and the case for low-level optimizationCustom calling conventions and CPU architecture abuse

Full transcript available for MurmurCast members

Sign Up to Access

Get AI summaries like this delivered to your inbox daily

Get AI summaries delivered to your inbox

MurmurCast summarizes your YouTube channels, podcasts, and newsletters into one daily email digest.