Opus 4.8 Scored 81. Your Workflow Doesn't Care.
The creator argues that Claude Opus 4.8 is a strong but inconsistent 'checkpoint release' timed to Anthropic's funding announcement rather than a true frontier breakthrough. The core argument is that in 2026, the 'harness' (scaffolding around the model) matters more than raw model intelligence, and OpenAI's Codex with GPT-5.5 currently outperforms Claude's ecosystem for long-running agentic tasks. Anthropic's real awaited model is 'Mythos,' and enterprise leaders should architect for model flexibility rather than betting on a single provider.
Summary
The video opens by arguing that the standard 2025 framing of AI — where a new model drop automatically means a new capability ceiling everyone should adopt — no longer applies in 2026. Claude Opus 4.8, released May 28th, is presented as a strategically timed release tied to Anthropic's funding announcement and valuation milestone (near $1 trillion), rather than a genuine leap forward. The real anticipated model from Anthropic is 'Mythos,' which the creator suspects is delayed due to compute constraints.
The creator identifies two core reasons why 4.8 is not becoming his daily driver. First, scaling up reasoning effort does not predictably improve results. Unlike OpenAI models where 'extra high' reasoning reliably beats lower settings, 4.8's 'max' mode sometimes performs worse than 'high' mode. This is supported by the Vending Bench benchmark, where 4.8 on max underperformed 4.8 on high, and both were beaten by 4.7 — a clear regression. The creator attributes this to the model 'overthinking' constitutional alignment questions, with reasoning traces showing the model deliberating on warm paragraph writing and Amanda Askell's preferences rather than focusing on the task.
Second, and more broadly, the creator argues that 'harnesses' — the product scaffolding around a model — are now the primary differentiator, not the model itself. He compares Claude Code (4.8's harness) with Codex (GPT-5.5's harness) on practical tasks, including building and deploying full websites end-to-end. Codex completed two full website builds while 4.8 errored out twice. He also notes that Codex has full file system access and computer use that works reliably in practice, while Claude's desktop app limits file access to Downloads and Desktop without proactively asking for broader permissions.
The creator highlights one genuine innovation in 4.8: the '/workflows' command in Claude Code, which lets the model dynamically compose multi-agent workflows, disclose them to the user, and execute them transparently. He predicts this pattern will be widely copied but notes it currently accelerates work without solving the downstream 'piling problem' — where agents generate more work than humans can review unless the entire pipeline is designed to be agent-native.
The video closes with advice for different audiences: knowledge workers should evaluate whether Claude's writing and front-end design strengths justify its harness limitations; engineers (70% of whom use Claude Code per surveys) should ensure their tooling supports team-level outcomes not just individual productivity; and CTOs/CIOs should architect for model flexibility, anticipating strong open-source 10-trillion-parameter models by year-end and the likelihood that Claude will lead the race again once Mythos ships.
Key Insights
- The creator argues that Opus 4.8 was released specifically to accompany Anthropic's funding announcement and valuation milestone, not because it represents a true frontier capability breakthrough — calling it a 'placeholder release' to show continued progress while everyone waits for Mythos.
- Vending Bench results showed a regression: 4.8 on max mode underperformed 4.8 on high mode, and both were beaten by 4.7 — directly contradicting the established principle that scaling up reasoning effort reliably improves results.
- The creator claims 4.8's reasoning traces in max mode show the model deliberating extensively on constitutional alignment questions — including referencing Amanda Askell's preferences — rather than focusing on the task, causing it to be less effective despite thinking more.
- In a direct head-to-head test, Codex with GPT-5.5 completed two full end-to-end website builds (including DNS deployment) in the time it took Claude 4.8 to error out twice, with the creator attributing the gap to harness quality rather than model intelligence.
- The creator identifies the '/workflows' command in Claude Code as a genuinely novel agentic pattern — where the model dynamically composes a multi-agent workflow, discloses it to the user, and executes it transparently — and predicts it will be widely copied across the industry by summer 2026.
Topics
Full transcript available for MurmurCast members
Sign Up to Access