OpinionTechnical

The next big breakthrough will be AIs learning on the job

Dwarkesh PodcastJune 26, 202619m 38s

The speaker discusses how AI labs are betting on reinforcement learning from verified rollouts (RLVR) to achieve AGI, but argues this approach has fundamental limitations. He contends that true general intelligence requires continual on-the-job learning through weight updates, which current scaling paradigms don't adequately address.

Summary

The transcript explores the current AI research paradigm where labs train agents on millions of verifiable tasks across diverse RL environments, betting this will create AGI-level problem-solving agents. Proponents argue that scaling compute can overcome current limitations like data inefficiency and lack of continual learning, similar to how LLMs were solved through scale.

The speaker identifies a critical bottleneck: domains must be 'grindable'—allowing parallel rollouts from identical starting points in deterministic simulators. Computer use has progressed slowly compared to coding and math precisely because it lacks this property; you cannot run thousands of parallel bot checkouts on Amazon without detection. This reveals a fundamental limitation: many real-world domains (building businesses, winning elections, trading successfully) cannot be recreated in data centers or reset easily, requiring actual real-world interaction over months or years.

The speaker argues that RLVR generalization may not be infinitely strong. Just as Dario Amodei suggested short-horizon RL training doesn't necessarily generalize to long-horizon performance, training on white-collar tasks may not generalize to building a business from scratch like Sam Walton. He emphasizes that while in-context learning is improving, without distilling knowledge back into weights, gains made during deployment remain ephemeral.

The speaker highlights that 30-50% of compute goes to inference—currently not improving the model—despite deployment being where the most valuable learning signals appear. Current online learning only works for limited cases where the same objective (like Cursor's tab prediction) can be learned across millions of users, but real continual learning requires learning different things for different users and organizations.

He proposes on-policy self-distillation (OPSD) as a solution superior to both RLVR and supervised fine-tuning. OPSD doesn't require outer-loop verification and provides denser supervision signals than RL. It also avoids memorizing transcripts of sessions, instead consolidating only relevant insights—a property RL excels at through sparse parameter updates.

The speaker introduces 'dreaming' as a speculative fourth scaling axis: models building and training against simulated environments of reality to practice skills, potentially gaining orders of magnitude more simulated samples. He references EfficientZero, which played dozens of simulated games internally for each real game step.

Finally, he sketches a 2027-2028 scenario where RLVR produces competent agents deployed broadly to real-world work. After weeks of context, user feedback triggers weight distillation through OPSD or dreaming. This enables AI capabilities to expand beyond original training domains through iterative adjacent-domain learning, with improvement primarily coming from broad deployment rather than pre-release training.

About this episode

Read it <a href="https://www.dwarkesh.com/p/the-next-paradigm" target="_blank">here</a>.Thanks to Mercury for sponsoring this essay.<a href="https://mercury.com/" target="_blank">Mercury</a> has automated basically my entire bill pay process for my business. I just give contractors a dedicated email address, and when they send an invoice, Mercury automatically creates a draft payment for me to review. I no longer have to hunt through my inbox for invoices or deal with messy spreadsheets to track my bills. Mercury handles it all. Learn more at <a href="http://mercury.com" target="_blank">mercury.com</a>Timestamps:(00:00:00) – The big research bet the labs are making(00:02:12) – Grindability is just as important as verifiability(00:06:10) – Will RLVR alone generalize?(00:08:41) – Getting the learning back to the weights(00:15:22) – Dreaming(00:17:23) – What 2027 looks like Get full access to Dwarkesh Podcast at <a href="https://www.dwarkesh.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4">www.dwarkesh.com/subscribe</a>

Key Insights

The speaker argues that domains must be 'grindable' (allowing parallel deterministic rollouts from identical starting points) for RL training to work effectively, and this property is absent in most real-world domains like business building or politics, creating a fundamental bottleneck that scaling alone cannot overcome.
The speaker claims that short-horizon RL training may not generalize to long-horizon performance, suggesting that training on containerized tasks cannot guarantee competence in open-ended real-world scenarios, contradicting the scaling-solves-everything hypothesis.
The speaker contends that 30-50% of compute spent on inference is currently wasted because it doesn't improve models, despite deployment being where the most valuable learning signals (real organizational knowledge, domain-specific tacit knowledge) actually exist.
The speaker argues that on-policy self-distillation (OPSD) is superior to supervised fine-tuning for continual learning because RL-based updates only change parameters as much as necessary to achieve outcomes, preventing catastrophic forgetting of pre-existing knowledge, whereas memorizing session transcripts does not enable genuine learning.
The speaker proposes that future AI improvement will shift from pre-release training to post-deployment learning, where capability gains come primarily from accumulated real-world experience across all users and domains, fundamentally changing how AI systems improve over time.

Topics

Reinforcement learning from verified rollouts (RLVR) as path to AGILimitations of RLVR in non-reproducible, real-world domainsContinual learning and weight updates from deploymentIn-context learning vs. parameter learning trade-offsOn-policy self-distillation (OPSD) as continual learning mechanismTest-time training and 'dreaming' as speculative scaling axisSample efficiency in AI trainingComputer use as a case study in domain-specific progress barriers

Transcript

So here's the big research that all the labs are making. They think that if we train AIs to accomplish millions of verifiable tasks across thousands of diverse RL environments, then we will have basically built AGI. Because this kind of training will have created a kind of problem-solving agent, the kind of thing that can make progress on open-ended tasks for weeks on end in the face of errors and mistakes and ambiguity. People who are optimistic about this vision will say that all these things that we talk about as the fundamental deficits in the current training paradigm, for example, the data inefficiency of these models or the fact that they lack into new learning, these things can…

Full transcript available for MurmurCast members

View original source →

More from Dwarkesh Podcast

Get AI summaries like this delivered to your inbox daily

The next big breakthrough will be AIs learning on the job

Summary

About this episode

Key Insights

Topics

Transcript

More from Dwarkesh Podcast

Grant Sanderson – AI and the future of math

The data black hole at the center of AI

Ada Palmer – Machiavelli is the most misunderstood thinker of all time

Alex Imas and Phil Trammell – What remains scarce after AGI?

Eric Jang – Building AlphaGo from scratch

Get AI summaries delivered to your inbox