My taste and the automated benchmark disagreed almost completely
The speaker discusses discrepancies between their subjective evaluation of model performance and automated benchmark results, noting that different judges (Opus, 4A, 5.5) show varying levels of generosity and bias. They conclude that personal judgment matters significantly and plan to incorporate more subjective taste into evaluation metrics while retiring saturated benchmark tasks.
Summary
The speaker describes an evaluation process where multiple language models served as judges to assess other models' outputs. They tested models including Opus, Claude 4A, and GPT 5.5, with particular focus on whether judges showed inherent bias toward themselves. A key finding was that GPT 5.5 emerged as the toughest judge despite being evaluated, and the speaker expresses a preference for using it as a judge. However, 5.5 judged itself more harshly than other judges evaluated it, suggesting some self-critical bias. Overall, the judges showed general agreement but tended toward generosity in their assessments. To balance these tendencies, the speaker implemented a double-bench approach with multiple judges. The speaker reflects that their personal taste diverged significantly from what the metrics indicated, leading to the conclusion that subjective evaluation ('vibe checks') shouldn't be dismissed. As a result, they plan to encode more of their personal taste directly into the judgment framework. Additionally, they identify that certain benchmark tasks have become saturated—specifically mentioning agentic bug tracking—where all models perform comparably well, making these tasks poor discriminators for evaluation purposes.
Key Insights
- GPT 5.5 was consistently the toughest judge across evaluations, even though it judged itself lower than other judges evaluated it
- Multiple judges show overall agreement but demonstrate a systematic tendency toward generosity in their assessments
- The speaker's personal taste and subjective evaluation diverged significantly from what automated metrics indicated
- Model performance depends on the specific task and how well a model's strengths fit that particular task type
- Saturated tasks like agentic bug tracking don't function as effective benchmarks because all evaluated models perform comparably well on them
Topics
Transcript
[0:00] We had a model as a judge, and so we had Opus 4A and 5.5 judge itself. I had the benchmark check if there was any inherent bias, like did Opus like 5.5 better? I've consistently seen GPT 5.5 be the toughest judge, and so I actually prefer a 5.5 judge, but it judged itself lower than the other judge did. The judges overall agree, but they were overall generous, and sort of balancing these two judges is exactly why we ran [0:30] this double bench. Takeaways, the model's going to depend on the job and the strength of the model fit by task. I would say my taste actually matters, so maybe those vibe checks are not bad,…
Full transcript available for MurmurCast members
Sign Up to AccessMore from How I AI
Task-by-task model recommendations
The speaker provides task-specific model recommendations across different use cases, suggesting GPT 5.5 for PRDs, Sonnet 4.6 for prototyping and casual interaction, and Opus 4.8 or Sonnet 5 for codebase work. Model selection varies based on complexity, with Opus 4.8 excelling at dense UI design and Sonnet suitable for simpler implementations.
The How I AI Bench
The speaker introduces 'The How I AI Bench,' a new set of human and AI-graded benchmarks designed to evaluate language models on practical tasks like writing PRDs, solving bugs, and designing systems. They test Claude Sonnet 3.5 against these benchmarks and note that while it scores lower than some specialized benchmarks (69% on Agentic Coding SweetBench Pro, 82% on Terminal Bench 2.1), the difference may not be noticeable in real-world usage.
How a designer became a top engineer
Katie transitioned from designer to top-performing engineer, ranking in the 94th percentile for code throughput across the entire R&D organization. Her success stemmed from technical curiosity combined with supportive engineer mentors who reviewed her code and helped her improve her craft.
No meetings, no Jira, no text threads... and it shipped anyway.
A team successfully shipped a project in 10 weeks by eliminating traditional project management structures entirely—no meetings, Jira, documentation, or text communication. Instead, they relied solely on a 24/7 Zoom room where team members could work synchronously and asynchronously as needed.
Claude automates the busy-work so you can spend more quality time with your kids
The speaker discusses how Claude's Co-worker feature helps parents automate tedious online administrative tasks, freeing up time for more meaningful interactions with their children. By handling tasks like returns and help emails, AI removes low-value busywork rather than replacing genuine human experiences.