TechnicalOpinion

The How I AI Bench

How I AI

The speaker introduces 'The How I AI Bench,' a new set of human and AI-graded benchmarks designed to evaluate language models on practical tasks like writing PRDs, solving bugs, and designing systems. They test Claude Sonnet 3.5 against these benchmarks and note that while it scores lower than some specialized benchmarks (69% on Agentic Coding SweetBench Pro, 82% on Terminal Bench 2.1), the difference may not be noticeable in real-world usage.

Summary

The speaker expresses frustration with relying solely on subjective 'vibe checks' to evaluate AI models and announces the creation of 'The How I AI Bench'—a standardized set of benchmarks combining human and AI grading. The benchmark focuses on three practical competencies: writing PRDs (product requirement documents), solving bugs, and executing one-shot design tasks. These benchmarks are intended to be regularly used to assess new models as they release. The speaker then applies this benchmark to Claude Sonnet 3.5, presenting comparative performance data. While acknowledging that Sonnet 3.5 doesn't reach the top scores on other specialized benchmarks like Agentic Coding SweetBench Pro (69%) or Terminal Bench 2.1 (82%), the speaker suggests these marginal differences are unlikely to significantly impact most users' actual experience with the model. The speaker also notes the model's purported strengths in computer work and knowledge work tasks.

Key Insights

  • The speaker is moving away from subjective 'vibe checks' toward developing standardized benchmarks that measure AI model performance on practical tasks users actually care about
  • The How I AI Bench is specifically designed to test three practical competencies: writing PRDs, solving bugs, and executing one-shot design tasks using a combination of human and AI grading
  • Claude Sonnet 3.5 scores 69% on Agentic Coding SweetBench Pro and 82% on Terminal Bench 2.1, positioning it as competitive but not leading on specialized benchmarks
  • The speaker believes that the performance gaps between Sonnet 3.5 and higher-scoring models on technical benchmarks are unlikely to be noticed by most users in practical applications
  • Claude Sonnet 3.5 is reported to have particular strengths in computer work and knowledge work tasks

Topics

AI model benchmarking methodologyThe How I AI Bench frameworkClaude Sonnet 3.5 performance evaluationPractical AI use cases (PRDs, debugging, design)Human and AI-graded evaluation systems

Transcript

[0:00] I've been testing a lot of models and I'm starting to get bored of doing the vibe check. What I want to start developing is a set of benchmarks we can regularly test these new models against that you'll care about. So today I'm going to be introducing the how I AI bench, a set of AI and Clarvo graded benchmarks that are going to tell us if this model and any model is good at writing PRDs, solving bugs, and oneshotting designs. And we are going to [0:31] put Sonnet 5 to the test against that proposition. So, as you can see here, it's not quite at this [music] 69% on Agentic Coding SweetBench Pro or the 82%…

Full transcript available for MurmurCast members

Sign Up to Access

More from How I AI

Get AI summaries like this delivered to your inbox daily

Get AI summaries delivered to your inbox

MurmurCast summarizes your YouTube channels, podcasts, and newsletters into one daily email digest.