Inference, Diffusion, World Models, and More | YC Paper Club
The inaugural YC Paper Club at Pioneer featured five presentations on cutting-edge AI research topics including speculative decoding for faster LLM inference, diffusion model predictive control for robotics, world models, generalization theory, and data-efficient pre-training. The event brought together researchers and founders in the Bay Area to build a community bridging academic research and startup development. Each presenter shared novel algorithms and findings aimed at advancing AI capabilities, efficiency, and theoretical understanding.
Summary
The inaugural YC Paper Club was held at Pioneer (YC's Woodside facility) with roughly 100 selected attendees drawn from over 1,000 applicants, bringing together researchers and founders from the Bay Area AI community. The host framed the event as a way to revive the special collaborative energy of the space, recalling the early days of OpenAI being founded there during his own YC batch (Winter 2016).
The first paper, presented by Tanishk (Stanford), introduced 'Speculative Speculative Decoding' (SSD), an extension of standard speculative decoding for LLM inference. Standard speculative decoding uses a small draft model to generate token candidates that a larger target model then verifies in parallel, exploiting the asymmetry that verification is cheaper than generation. SSD goes further by parallelizing the normally sequential draft-verify loop: the draft model begins anticipating likely verification outcomes and pre-generates next-round drafts while the verifier is still running, hiding drafting latency almost entirely. The result is dramatically faster inference—demonstrated at ~300 tokens/second for Llama 3 70B on 4x H100s—which the presenter argues transforms inference from a cost/convenience lever into a capability lever as test-time compute scaling becomes more important.
The second paper was presented by Stannis (Google DeepMind), covering 'Diffusion Model Predictive Control' (DMPC). This work applies diffusion models to both the action proposal and dynamics modeling components of model predictive control (MPC) for robotics. By using multi-step diffusion-based action proposals and dynamics models, DMPC reduces compounding errors and simplifies the planning algorithm to a basic sampling-based planner while remaining competitive with prior state-of-the-art. Key advantages include runtime adaptation to novel reward functions and novel dynamics (e.g., a robot with a broken joint), enabled by the factorized representation of action proposals and dynamics.
The third paper, presented by Isaac Ward, covered 'Lay World Model' from Yann LeCun's group at Meta. World models learn to predict how a system's state changes in response to actions, enabling model-based control, imagined rollouts, and uncertainty quantification. The paper's core contribution is the 'SigG' regularizer—a term that enforces a Gaussian-distributed, isotropic latent space across a batch of embeddings—as an elegant way to prevent representational collapse during joint representation and dynamics learning, replacing more complex tricks used by prior methods. The model runs on a single GPU with only 15M parameters and is ~50x faster than competing approaches, while also enabling detection of out-of-distribution perturbations through model error spikes.
The fourth paper, presented by Ashe from QCabs, summarized Andrew Gordon Wilson's work 'Deep Learning is Not So Mysterious or Different,' which argues that classical generalization theories—specifically PAC-Bayes bounds and the concept of soft inductive biases—can explain phenomena like overparameterization, benign overfitting, and double descent that are often labeled as mysteries. The key insight is that larger models tend to find flatter, more compressible minima, lowering both the empirical risk and the compression term in PAC-Bayes bounds simultaneously, thus improving generalization.
The fifth and final paper was presented by Kuu (with co-authors including Percy Liang), addressing data-efficient pre-training under the regime where compute is unconstrained but data is fixed. Using only 200M tokens, the paper shows that aggressive regularization (30x higher weight decay than standard), ensembling, and distillation each contribute to measurable 'compute asymptote' reductions—meaning better performance under infinite compute. Combining regularization and ensembling into a 'joint scaling recipe' yields roughly a 5x data efficiency win over standard pre-training, and distillation can compress ensemble benefits into a single small model while retaining ~83% of the improvement. These findings hold in continued pre-training settings, achieving a ~17x data efficiency win on math-domain CPT.
Key Insights
- Tanishk argues that inference speed should be understood as a capability lever, not merely a cost or convenience factor—because for systems where performance scales with thinking time, tokens per second directly determines peak intelligence deliverable.
- Stannis demonstrates that DMPC's factorized representation of action proposals and dynamics models allows runtime adaptation to novel dynamics (e.g., a broken robot joint) by simply fine-tuning only the dynamics model on new play data, without retraining the action proposal.
- Isaac Ward argues that Lay World Model's SigG regularizer—which enforces a Gaussian isotropic distribution in the latent space—provides an elegant single-hyperparameter solution to representational collapse, replacing the complex and varied tricks used by prior world model methods.
- Ashe explains that Andrew Gordon Wilson's work shows overparameterization improves both terms of the PAC-Bayes bound simultaneously: larger models achieve lower training loss AND find more compressible (flatter minima) solutions, resolving the apparent mystery of why scaling improves generalization.
- Kuu shows that self-distillation—taking a trained model and distilling it into a fresh model of the same size—surprisingly improves validation loss beyond the regularization asymptote, with a theoretical connection to implicitly training a two-member ensemble.
Topics
Full transcript available for MurmurCast members
Sign Up to Access