Why I love GPT-5.5 for hard problems
Clara Vel, a product leader and AI enthusiast, shares her hands-on experience testing GPT-5.5 and GPT-5.5 Pro in Codex over several weeks. She highlights the model's superior intelligence and efficiency for complex technical problems, including autonomous security remediation, a multi-million-row data migration, and hacking into a proprietary Bluetooth device. Her core argument is that GPT-5.5's value is best realized by developers with genuinely hard problems, not average ChatGPT users.
Summary
Clara Vel opens the episode by announcing that OpenAI is releasing GPT-5.5 and GPT-5.5 Pro into Codex and ChatGPT (not yet available via API), and shares that she has been testing the model for several weeks. She confirms OpenAI's claims that the model has higher capacity for complex work and is more token-efficient, though she notes it comes at a steep price: $5/million input tokens and $30/million output tokens for GPT-5.5, and $30/$180 for GPT-5.5 Pro. Despite the cost, she argues the ROI is justified not just by speed gains but by enabling new levels of ambition — solving problems that were previously unsolvable.
Clara distinguishes between her ChatGPT and Codex experiences. In ChatGPT, she asked GPT-5.5 to build a subtraction learning app for her first-grade son. The model thought for over 17 minutes before producing a functional but visually unimpressive app. She uses this example to argue that most everyday ChatGPT users simply don't have problems complex enough to justify this level of intelligence, and that there is currently an 'intelligence overhang' where the capability exceeds available use cases for general consumers.
The bulk of the episode focuses on her Codex experience with GPT-5.5 Pro. Her first use case involved downloading a CSV of low-severity security issues identified by OpenAI's Codex security product, uploading it, and asking the model to architecturally review, group, and remediate them. The model executed this well, and the quality was validated when a subsequent annual penetration test came back clean. She frames this as a compelling use case for engineers with triage lists of technical debt or security issues.
Her second major use case involved a long-standing data migration problem in her Chat PRD codebase, where millions of chat records were stored in legacy formats across different AI providers' API response schemas. After repeated failed patching attempts, she handed the problem to GPT-5.5 in Codex with minimal prompting and it ran autonomously for nearly 6 hours — spawning sub-agents, testing data, identifying issues, and repairing them — without any follow-up prompts from Clara. The result was a solution that covered 98% of identified edge cases and reduced their Sentry error rate to near zero across 2 million rows, with only one uncaught edge case remaining.
The episode's climax is Clara's personal 'intelligence benchmark': hacking into a Divoom Mini 2 retro Bluetooth speaker with a proprietary display. She had been attempting this since around Valentine's Day, spending months sniffing Bluetooth packets, digging through Chinese-language hardware documentation, and trying Claude Code and GPT-5.4 — all without success. After uploading the packet sniffing logs and prompting GPT-5.5 with something like 'I believe in you,' the model successfully decoded the proprietary Bluetooth encoding and bitmap compression scheme, enabling her to control the display from the terminal. She also wired Codex notifications to the device, so it now alerts her on the screen when tasks complete.
Clara closes by summarizing that GPT-5.5 is her favorite 'staff software engineer,' praising its intelligence and autonomous efficiency. She notes one quirk: the model has what she calls a 'baked potato personality' by default in Codex — dull and dry — but this can be changed using the '/personality' command. She recommends the model strongly for developers with hard technical problems and invites listeners to share high-intelligence ChatGPT use cases she can test.
Key Insights
- Clara argues that GPT-5.5's primary value is not speed but 'ambition' — enabling her to solve problems that were previously impossible with any other model, rather than just doing existing tasks faster.
- Clara contends that most everyday ChatGPT users lack problems complex enough to justify GPT-5.5's intelligence, describing the current situation as an 'intelligence overhang' where capability outpaces available consumer use cases.
- GPT-5.5 in Codex ran fully autonomously for nearly 6 hours on a complex data migration task across 2 million rows with zero follow-up prompts from Clara, resulting in only one uncaught edge case and a near-zero Sentry error rate.
- Clara directly challenges the narrative that AI coding decreases software quality, arguing instead that quality will go up because models like GPT-5.5 can now autonomously handle complex, edge-case-heavy problems that engineers previously avoided due to insufficient tooling.
- After months of failed attempts using Claude Code and GPT-5.4, GPT-5.5 successfully reverse-engineered the proprietary Bluetooth encoding and bitmap compression of a Divoom Mini 2 device using packet-sniffing logs, which Clara uses as her personal benchmark for model intelligence.
Topics
Full transcript available for MurmurCast members
Sign Up to Access