The next big breakthrough will be AIs learning on the job
The speaker discusses how AI labs are betting on reinforcement learning from verified rollouts (RLVR) to achieve AGI, but argues this approach has fundamental limitations. He contends that true general intelligence requires continual on-the-job learning through weight updates, which current scaling paradigms don't adequately address.
Summary
The transcript explores the current AI research paradigm where labs train agents on millions of verifiable tasks across diverse RL environments, betting this will create AGI-level problem-solving agents. Proponents argue that scaling compute can overcome current limitations like data inefficiency and lack of continual learning, similar to how LLMs were solved through scale.
The speaker identifies a critical bottleneck: domains must be 'grindable'—allowing parallel rollouts from identical starting points in deterministic simulators. Computer use has progressed slowly compared to coding and math precisely because it lacks this property; you cannot run thousands of parallel bot checkouts on Amazon without detection. This reveals a fundamental limitation: many real-world domains (building businesses, winning elections, trading successfully) cannot be recreated in data centers or reset easily, requiring actual real-world interaction over months or years.
The speaker argues that RLVR generalization may not be infinitely strong. Just as Dario Amodei suggested short-horizon RL training doesn't necessarily generalize to long-horizon performance, training on white-collar tasks may not generalize to building a business from scratch like Sam Walton. He emphasizes that while in-context learning is improving, without distilling knowledge back into weights, gains made during deployment remain ephemeral.
The speaker highlights that 30-50% of compute goes to inference—currently not improving the model—despite deployment being where the most valuable learning signals appear. Current online learning only works for limited cases where the same objective (like Cursor's tab prediction) can be learned across millions of users, but real continual learning requires learning different things for different users and organizations.
He proposes on-policy self-distillation (OPSD) as a solution superior to both RLVR and supervised fine-tuning. OPSD doesn't require outer-loop verification and provides denser supervision signals than RL. It also avoids memorizing transcripts of sessions, instead consolidating only relevant insights—a property RL excels at through sparse parameter updates.
The speaker introduces 'dreaming' as a speculative fourth scaling axis: models building and training against simulated environments of reality to practice skills, potentially gaining orders of magnitude more simulated samples. He references EfficientZero, which played dozens of simulated games internally for each real game step.
Finally, he sketches a 2027-2028 scenario where RLVR produces competent agents deployed broadly to real-world work. After weeks of context, user feedback triggers weight distillation through OPSD or dreaming. This enables AI capabilities to expand beyond original training domains through iterative adjacent-domain learning, with improvement primarily coming from broad deployment rather than pre-release training.
About this episode
<p>Read it <a href="https://www.dwarkesh.com/p/the-next-paradigm" target="_blank">here</a>.</p><p>Thanks to Mercury for sponsoring this essay.</p><p><a href="https://mercury.com/" target="_blank">Mercury</a> has automated basically my entire bill pay process for my business. I just give contractors a dedicated email address, and when they send an invoice, Mercury automatically creates a draft payment for me to review. I no longer have to hunt through my inbox for invoices or deal with messy spreadsheets to track my bills. Mercury handles it all. Learn more at <a href="http://mercury.com" target="_blank">mercury.com</a></p><p>Timestamps:</p><p>(00:00:00) – The big research bet the labs are making</p><p>(00:02:12) – Grindability is just as important as verifiability</p><p>(00:06:10) – Will RLVR alone generalize?</p><p>(00:08:41) – Getting the learning back to the weights</p><p>(00:15:22) – Dreaming</p><p>(00:17:23) – What 2027 looks like</p> <br /><br />Get full access to Dwarkesh Podcast at <a href="https://www.dwarkesh.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4">www.dwarkesh.com/subscribe</a>
Key Insights
- The speaker argues that domains must be 'grindable' (allowing parallel deterministic rollouts from identical starting points) for RL training to work effectively, and this property is absent in most real-world domains like business building or politics, creating a fundamental bottleneck that scaling alone cannot overcome.
- The speaker claims that short-horizon RL training may not generalize to long-horizon performance, suggesting that training on containerized tasks cannot guarantee competence in open-ended real-world scenarios, contradicting the scaling-solves-everything hypothesis.
- The speaker contends that 30-50% of compute spent on inference is currently wasted because it doesn't improve models, despite deployment being where the most valuable learning signals (real organizational knowledge, domain-specific tacit knowledge) actually exist.
- The speaker argues that on-policy self-distillation (OPSD) is superior to supervised fine-tuning for continual learning because RL-based updates only change parameters as much as necessary to achieve outcomes, preventing catastrophic forgetting of pre-existing knowledge, whereas memorizing session transcripts does not enable genuine learning.
- The speaker proposes that future AI improvement will shift from pre-release training to post-deployment learning, where capability gains come primarily from accumulated real-world experience across all users and domains, fundamentally changing how AI systems improve over time.
Topics
Transcript
So here's the big research that all the labs are making. They think that if we train AIs to accomplish millions of verifiable tasks across thousands of diverse RL environments, then we will have basically built AGI. Because this kind of training will have created a kind of problem-solving agent, the kind of thing that can make progress on open-ended tasks for weeks on end in the face of errors and mistakes and ambiguity. People who are optimistic about this vision will say that all these things that we talk about as the fundamental deficits in the current training paradigm, for example, the data inefficiency of these models or the fact that they lack into new learning, these things can…
Full transcript available for MurmurCast members
Sign Up to AccessMore from Dwarkesh Podcast
Grant Sanderson – AI and the future of math
The discussion centers on the rapid advancements of AI in mathematics, exploring its implications for the future of math and related fields. The conversation highlights how AI's capabilities impact traditional mathematical roles, the process of knowledge creation, and the potential for new insights in various domains.
The data black hole at the center of AI
The transcript argues that AI's primary driver of progress is data quantity and quality rather than architectural improvements or scaling, highlighting a massive gap in sample efficiency between humans and AI models. The speaker contends that current AI systems are fundamentally different from human intelligence, requiring orders of magnitude more data to learn skills. Despite this inefficiency, AI can still automate white-collar work due to the economics of scale and parallelism.
Ada Palmer – Machiavelli is the most misunderstood thinker of all time
Ada Palmer discusses Machiavelli's political theories and their historical context, emphasizing the instability of Italian city-states and the influence of the papacy. She explores how Machiavelli's personal experiences and insights shaped his writings, particularly in 'The Prince' and 'Discourses on Livy'.
Alex Imas and Phil Trammell – What remains scarce after AGI?
Economists Alex Imas and Phil Trammell discuss what will remain scarce after AGI, covering labor share stability, the 'relational sector,' wealth redistribution mechanisms, and implications for developing countries. They explore historical parallels to industrial automation, the plausibility of various economic scenarios, and why negative economic growth from AI abundance is theoretically very difficult to achieve.
Eric Jang – Building AlphaGo from scratch
Eric Jang discusses the construction of AlphaGo from scratch, exploring its implications for AI research and development, particularly in game-playing AI and deep reinforcement learning. He emphasizes the significance of combining neural networks with Monte Carlo Tree Search (MCTS) to achieve superior performance in complex environments like Go.