The How I AI Bench
The speaker introduces 'The How I AI Bench,' a new set of human and AI-graded benchmarks designed to evaluate language models on practical tasks like writing PRDs, solving bugs, and designing systems. They test Claude Sonnet 3.5 against these benchmarks and note that while it scores lower than some specialized benchmarks (69% on Agentic Coding SweetBench Pro, 82% on Terminal Bench 2.1), the difference may not be noticeable in real-world usage.
Summary
The speaker expresses frustration with relying solely on subjective 'vibe checks' to evaluate AI models and announces the creation of 'The How I AI Bench'—a standardized set of benchmarks combining human and AI grading. The benchmark focuses on three practical competencies: writing PRDs (product requirement documents), solving bugs, and executing one-shot design tasks. These benchmarks are intended to be regularly used to assess new models as they release. The speaker then applies this benchmark to Claude Sonnet 3.5, presenting comparative performance data. While acknowledging that Sonnet 3.5 doesn't reach the top scores on other specialized benchmarks like Agentic Coding SweetBench Pro (69%) or Terminal Bench 2.1 (82%), the speaker suggests these marginal differences are unlikely to significantly impact most users' actual experience with the model. The speaker also notes the model's purported strengths in computer work and knowledge work tasks.
Key Insights
- The speaker is moving away from subjective 'vibe checks' toward developing standardized benchmarks that measure AI model performance on practical tasks users actually care about
- The How I AI Bench is specifically designed to test three practical competencies: writing PRDs, solving bugs, and executing one-shot design tasks using a combination of human and AI grading
- Claude Sonnet 3.5 scores 69% on Agentic Coding SweetBench Pro and 82% on Terminal Bench 2.1, positioning it as competitive but not leading on specialized benchmarks
- The speaker believes that the performance gaps between Sonnet 3.5 and higher-scoring models on technical benchmarks are unlikely to be noticed by most users in practical applications
- Claude Sonnet 3.5 is reported to have particular strengths in computer work and knowledge work tasks
Topics
Transcript
[0:00] I've been testing a lot of models and I'm starting to get bored of doing the vibe check. What I want to start developing is a set of benchmarks we can regularly test these new models against that you'll care about. So today I'm going to be introducing the how I AI bench, a set of AI and Clarvo graded benchmarks that are going to tell us if this model and any model is good at writing PRDs, solving bugs, and oneshotting designs. And we are going to [0:31] put Sonnet 5 to the test against that proposition. So, as you can see here, it's not quite at this [music] 69% on Agentic Coding SweetBench Pro or the 82%…
Full transcript available for MurmurCast members
Sign Up to AccessMore from How I AI
How a designer became a top engineer
Katie transitioned from designer to top-performing engineer, ranking in the 94th percentile for code throughput across the entire R&D organization. Her success stemmed from technical curiosity combined with supportive engineer mentors who reviewed her code and helped her improve her craft.
No meetings, no Jira, no text threads... and it shipped anyway.
A team successfully shipped a project in 10 weeks by eliminating traditional project management structures entirely—no meetings, Jira, documentation, or text communication. Instead, they relied solely on a 24/7 Zoom room where team members could work synchronously and asynchronously as needed.
Claude automates the busy-work so you can spend more quality time with your kids
The speaker discusses how Claude's Co-worker feature helps parents automate tedious online administrative tasks, freeing up time for more meaningful interactions with their children. By handling tasks like returns and help emails, AI removes low-value busywork rather than replacing genuine human experiences.
Use Claude as your personal shopping assistant
A parent describes using Claude as a household management and shopping assistant to find high-quality, naturally-made products from reputable brands. They created a project in Claude with specific brand criteria and used it to organize notes and vet brands. A key benefit highlighted was Claude surfacing that a previously reputable brand had declined in quality after a corporate takeover.
She built a Claude shopping assistant to stop buying cheap junk
Nicole Ruiz demonstrates how she built a Claude project to automate high-quality shopping decisions for her family, using curated brand lists and purchasing criteria to filter out cheap, poorly-made products. She also shows how Claude Computer Use helps her draft return emails by pulling order details directly from her Gmail. The system is designed to reduce the mental overhead of conscious consumption so she can spend more time with her children.