The ONLY AI Benchmark You Need!
A developer created "Buccy Bench," a humorous yet functional AI benchmark that tasks different language models with drawing Gary Busey as SVG code rather than images. The benchmark tracks model evolution over time while measuring performance metrics like cost, tokens, and execution speed.
Summary
The creator built an unconventional AI benchmark called Buccy Bench with a single, absurd task: have AI models draw Gary Busey using SVG code. Rather than using traditional image generation models like DALL-E or Stable Diffusion, this benchmark requires models to write actual code—shapes and lines—that somehow compose into a recognizable rendering of the actor. The test is simple in concept but reveals interesting results: different AI models produce vastly different interpretations, ranging from reasonable attempts to increasingly weird outputs. By tracking models starting from GPT-3.5 Turbo in March 2023, the creator documented how different models evolved in their ability to generate SVGs, with some improving over time while others produced chaotic or bizarre results. The benchmark includes practical features like sorting by cost, tokens used, and execution time, as well as a timeline view that allows users to filter by provider and track each model's "Gary Busey journey." The entire website was built using Fable. While intentionally ridiculous in premise, the creator notes that the benchmark is actually useful—combining entertainment value with legitimate performance measurement capabilities.
Key Insights
- The creator uses SVG code generation as a benchmark task because it requires AI models to write actual code that produces visual output, making it fundamentally different from traditional image generation models
- GPT-3.5 Turbo's March 2023 attempt at drawing Gary Busey via SVG represents an early baseline point in tracking how different AI models evolved at this specific task
- Different AI models show divergent trajectories when generating SVGs—some improve over time while others produce increasingly chaotic or weird results
- The benchmark includes practical performance comparison features like cost analysis, token usage tracking, and execution time measurement alongside the humorous visual results
- The creator built the entire benchmark website using Fable and views the project as intentionally ridiculous in concept but genuinely useful in execution
Topics
Transcript
[0:00] I built a benchmark where AI models have one job. Draw Gary Buucy using code. It's called Buccy Bench and it's exactly as ridiculous as it sounds. So the test is simple. I asked different AI models to draw Gary Buucy as an SVG. Now it's not a normal AI image. It's not using nano banana or dolly or stable diffusion or anything like that. SVGs are actually code. The model has to write shapes and lines that somehow become Gary Buucy. and that makes the results kind of awesome. Back in March 2023, GPT 3.5 [0:32] Turbo had its own very special interpretation of Gary Buucy. Then you scroll forward and you could watch the models evolve at…
Full transcript available for MurmurCast members
Sign Up to AccessMore from Matt Wolfe
GLM-5.2 - The Open Model That's As Good As Opus!
A comprehensive review of GLM-5.2, an open-weight Chinese AI model with a 1 million token context window, demonstrating its capabilities for coding, document analysis, and agentic workflows at significantly lower costs than frontier models like Claude Opus and GPT-4.5. The speaker tests various use cases including website building, Chrome extensions, game development, and data organization, concluding it's valuable for long, code-heavy, token-expensive tasks despite not universally outperforming closed-source alternatives.
Don't Fall For This AI Trap
The speaker emphasizes that power users distinguish themselves by knowing what NOT to automate with AI, rather than automating everything. They argue that AI works best for clear, straightforward tasks but struggles with nuanced, artistic work requiring consistency—using their failed YouTube thumbnail automation as an example.
AI News: Fable Banned, New Open-Source Leader, Midjourney Shocker
This AI news roundup covers the US government forcing Anthropic to shut down its Fable 5 and Mythos 5 models due to a security vulnerability jailbreak, the release of a competitive open-source model GLM 5.2 by ZAI, and MidJourney's surprising pivot into medical imaging technology with a new ultrasound-based body scanner.
AI News: Claude's Massive Leap & Siri Gets Good!?
This AI news roundup covers the release of Claude Fable 5 (a Mythos-tier model from Anthropic) and its controversial safety restrictions, Apple's WWDC AI announcements including a major Siri overhaul, and updates from Google including NotebookLM upgrades and real-time translation. Additional rapid-fire items include OpenAI and SpaceX IPO filings, ChatGPT email sending, and a teased Midjourney hardware device.
Shopping Online Is About To Change Forever
The video introduces 'agentic commerce,' a new AI-driven shopping paradigm where AI agents proactively match users to products before they search. The platform Glance is highlighted as a leading example, using selfies and personal data to generate personalized outfit recommendations with direct purchase links. The creator frames this as a major evolution in e-commerce beyond chat-based AI search.