ResearchTechnical

5 Papers That Show Where AI Research Is Heading Right Now

Y Combinator

A research meetup covering five AI topics: protein language model scaling laws (ESM Cambrian), self-play for LLMs (SGS algorithm), streaming RAG for voice agents, formal verification with Lean, and agentic software engineering workflows. Presenters demonstrate how foundational AI scaling principles are transferring into biology, mathematics, and production engineering.

Summary

The event opens with host François introducing themes around AI research directions, including his skepticism that training on human-generated data (subspace H) plus test-time compute can reach the full solution space F, arguing instead that Alpha Zero-style self-play unbiased by human demonstrations is the more probable path to AGI. He also frames two key open problems: intelligence per sample (optimal continual learning as data streams in) and intelligence per watt (smaller, more efficient models).

Yas Beg presents 'The Bitter Lesson Comes to Biology,' covering a recent paper from Biohub on the ESM Cambrian (ESMC) protein language model. The core finding is that scaling laws observed in language models do transfer to protein biology: log-linear improvement in structural understanding with compute is achievable, but only after pushing training data from 50 million to 2.8 billion sequences by incorporating metagenomic data. The prior ESM2 models had plateaued, and the fix was data scale rather than architectural cleverness. ESMC also approaches or beats AlphaFold 3 on antibody design tasks without requiring handcrafted multiple sequence alignments (MSAs), partially validating the bitter lesson in biology. Mechanistic interpretability analysis using sparse autoencoders reveals the model's latent space spontaneously organizes into a hierarchy of biologically meaningful features—from individual amino acids up to protein functional roles—without any supervised signal.

Luke Bailey presents 'Scaling Self-Play with Self-Guidance' (SGS), a paper on asymmetric self-play for LLMs applied to formal math proving in Lean. He explains that standard self-play with a solver-rate reward fails in practice because the conjecturer learns to generate artificially complex, inelegant problems that are hard but useless for learning. SGS fixes this by grounding synthetic problem generation near unsolved target problems and introducing a 'guide' model that scores whether generated problems are genuinely related and not just obfuscated. Results show a 7B model trained with SGS reaches the pass-at-k performance of a 670B model on formal math benchmarks, though the method still plateaus well below 100% solve rate.

Arnab Matei presents a Meta paper on streaming RAG for voice AI agents, highlighting that standard RAG pipelines add unacceptable latency for conversational voice interfaces. The paper explores two approaches: fixed-interval streaming RAG (running retrieval on audio chunks as they arrive, using early retrieval signal consistency to decide when to commit) and a fine-tuned model that learns to trigger retrieval only when a partial query contains sufficient new information. The key research framing is identifying the right decision boundary for when a partial spoken query is 'good enough' to retrieve on, reducing latency by 0.5–1.5 seconds while maintaining accuracy parity.

Robert George presents on Lean and formal verification, arguing we are entering an era of 'verified intelligence.' He traces rapid progress from 2020's GPT-f through recent IMO gold-medal results and open Erdős problem solutions, covering both proof automation and program verification. He introduces TorchLean, the first unified framework for writing neural networks in Lean, enabling verified floating-point arithmetic, certified robustness, and formal proofs of properties like attention permutation invariance. He demonstrates formalizing a known result about non-determinism in LLM inference (where tiny floating-point differences flip argmax outputs) fully in TorchLean.

Luke Orthwine closes with a practitioner talk on agentic software engineering, drawing an analogy to real-time strategy games. He argues that the optimal coding workflow with agents is macro-focused, highly parallel, and minimizes human keystrokes per unit of work. Key practices include: using git worktrees for parallel isolated agent workspaces, running all agents in dangerously-skip-permissions mode inside sandboxes, having agents always push to PR rather than stopping for approval, building a structured linked knowledge base that agents and humans share, and using APM (agent tool calls per minute) as a productivity metric. His team reported a 3.5x increase in PRs per engineer per month after fully adopting these practices.

Key Insights

  • François argues that training on human-generated data subspace H, even with infinite test-time compute or recursive self-improvement, makes it infeasible to sample the full solution space F minus H, making Alpha Zero-style self-play—unbiased by human demonstrations—the more probable path to highly intelligent systems.
  • Yas Beg reports that prior ESM2 protein language models showed a plateau in scaling performance, and the fix was not architectural innovation but simply scaling training data from 50 million to 2.8 billion sequences using metagenomic data from soil, oceans, and human guts, restoring smooth log-linear scaling curves.
  • Luke Bailey identifies that vanilla self-play for LLMs fails because rewarding the conjecturer purely on solver difficulty causes it to generate artificially obfuscated, inelegant problems—analogous to giving a student a three-page high school calculus problem to guarantee a 50% error rate—rather than problems useful for genuine capability improvement.
  • Yas Beg shows that sparse autoencoder analysis of ESMC's latent space reveals it spontaneously organizes into a clean biological hierarchy—from individual amino acids to structural motifs to whole protein functional roles—purely from masked language modeling with no supervised biological annotation.
  • Luke Orthwine reports his team achieved a 3.5x increase in PRs per engineer per month through agentic workflows, and an additional 60% gain in the most recent month after broad team adoption, arguing that agents should always be pushed to PR completion even if wrong, because course-correction is cheaper than waiting for human approval at each step.

Topics

Protein language model scaling laws (ESM Cambrian / ESMC)Asymmetric self-play for LLMs (SGS algorithm)Streaming RAG for low-latency voice AI agentsFormal verification with Lean and TorchLeanAgentic software engineering and RTS-style parallel workflowsBitter lesson applicability to biologyIntelligence per sample and continual learningMechanistic interpretability of protein language models

Full transcript available for MurmurCast members

Sign Up to Access

Get AI summaries like this delivered to your inbox daily

Get AI summaries delivered to your inbox

MurmurCast summarizes your YouTube channels, podcasts, and newsletters into one daily email digest.