Understanding the Most Viral Chart in Artificial Intelligence
The Odd Lots podcast hosts Joe Weisenthal and Tracy Alloway interview Joel Becker and Chris Painter from Meter, a nonprofit AI safety organization, about their viral 'time horizon' charts that measure AI model capabilities. The discussion covers how these benchmarks are constructed, their limitations, the safety motivations behind them, and the broader tensions between AI development, investment incentives, and safety concerns.
Summary
The episode centers on Meter's 'time horizon' charts, which have become arguably the most viral visualization in AI. Hosts Joe Weisenthal and Tracy Alloway speak with Joel Becker (technical staff) and Chris Painter (president) of Meter, a San Francisco-based nonprofit focused on AI safety evaluation.
Meter's core mission is assessing AI autonomy and dangerous capabilities — specifically whether AI systems could assist in catastrophic activities like bioweapon development or large-scale cyberattacks. The time horizon charts were originally designed to answer a safety question: when would AI systems be capable enough that concerns about misalignment and loss of human control would become meaningful? The intuition was that three to four years ago, AI systems were so limited that fears about them 'going rogue' were almost nonsensical.
The time horizon metric works by having skilled human engineers complete the same software and machine learning tasks given to AI models, timing how long those tasks take. The difficulty of a task is expressed in human-hours. The AI's 'time horizon' is the task length at which it achieves a 50% success rate. As of early 2026, Claude Opus 4.6 achieved a time horizon of approximately 12 hours — nearly double the previous high of around 6 hours set by GPT codex models — causing the chart to go viral. The doubling time of AI capabilities, as measured by this metric, appears to have accelerated from roughly every 7 months to approximately every 4 months.
The hosts probe several nuances and potential criticisms of the charts. At the 80% success threshold rather than 50%, the progress looks less dramatic, though Joel argues the doubling rate is essentially the same — just offset. The 50% threshold is used partly for statistical reasons: it requires fewer samples to estimate reliably and matches conventions in prior literature. Joel also acknowledges that with only about three human baselines per task, the methodology has real limitations, and that baselining will become increasingly difficult as AI time horizons extend beyond months.
A key limitation discussed is the gap between benchmark performance and real-world productivity. The tasks measured are narrowly focused on software and machine learning engineering — the domain where AI labs are actively optimizing. Real-world tasks tend to be messier, involve larger codebases, require collaboration, and are evaluated more holistically than algorithmic scoring allows. These factors mean benchmarks likely overstate productivity gains somewhat, though Joel believes the underlying progress is real and generalizing.
On Chinese models, Meter has largely not included them on their primary charts because they appear to lag U.S. frontier models by roughly 9–12 months on time horizon metrics, and possibly even more than benchmark scores alone would suggest — with Chris hinting that Chinese models may perform better on benchmarks than on truly held-out problems, something 'spiritually close' to benchmark gaming.
The conversation also explores the strange sociological dynamic of the AI industry, where the people most enthusiastic about building AI are also often the most alarmed about its risks. Chris draws on the analogy of the Manhattan Project, noting that many early AI safety researchers got into the field precisely because they saw deep learning trends and worried about what full artificial general intelligence or superintelligence could mean. The competitive dynamics — between labs, and between the U.S. and China — create a situation where no individual actor feels able to slow down unilaterally, even if they wanted to.
On investment implications, Meter does not engage much with investors directly, but Chris argues that broad public awareness of AI capabilities is actually preferable to selective knowledge — he would rather all of humanity understand where AI is heading than have only certain actors informed. However, he acknowledges tension between financial commitments (like massive data center investments already baked in) and the ability to slow development if safety risks emerged.
Meter itself has about 30 people, is growing, and operates as a nonprofit unable to offer equity. They compete on cash compensation and attract people motivated by working on uniquely important, public-facing research outside the competitive lab environment. Joel closes by noting the team is in a state of triage, identifying 20–30 critical research questions but only able to address one or two per quarter.
Key Insights
- The time horizon metric does not measure how long an AI works continuously, but rather the difficulty of tasks — expressed in human-hours — at which the AI achieves a 50% success rate, based on timing skilled humans doing the same tasks.
- Claude Opus 4.6 achieved a time horizon of approximately 12 hours as of early 2026, nearly doubling the previous high of ~6 hours, which is what made the chart go viral — the doubling time of AI capabilities appears to have accelerated from ~7 months to ~4 months.
- The 50% success threshold is used rather than 80% or higher partly for statistical reasons: estimating reliability at very high success rates requires far more samples and is highly sensitive to grading noise, making 50% the most statistically tractable point.
- Joel Becker argues that benchmark performance likely overestimates real-world productivity gains because real tasks are messier, involve larger codebases, require collaboration, and are evaluated more holistically than algorithmic scoring captures.
- Chris Painter argues that broad public awareness of AI capability trends is preferable to selective knowledge, framing Meter's mission as informing all of humanity — including investors and governments — rather than gatekeeping information to any particular group.
- Chinese AI models have been excluded from Meter's primary charts because they appear to lag U.S. frontier models by roughly 9–12 months on time horizon metrics, and Chris suggested their benchmark scores may overstate their capabilities relative to truly held-out problems.
- Chris identifies a key tension between large financial commitments — such as debt taken on to build data centers — and the ability to slow AI development if safety concerns emerged, arguing these obligations could force continued scaling even against better judgment.
- Meter operates in a state of triage with roughly 30 staff, identifying 20–30 world-important research questions per quarter but only able to address one or two, and Joel argues the primary bottleneck is technical talent rather than model access, as AI labs have generally been cooperative with third-party evaluation.
Topics
Full transcript available for MurmurCast members
Sign Up to Access