FunnyTechnical

The ONLY AI Benchmark You Need!

Matt Wolfe

A developer created "Buccy Bench," a humorous yet functional AI benchmark that tasks different language models with drawing Gary Busey as SVG code rather than images. The benchmark tracks model evolution over time while measuring performance metrics like cost, tokens, and execution speed.

Summary

The creator built an unconventional AI benchmark called Buccy Bench with a single, absurd task: have AI models draw Gary Busey using SVG code. Rather than using traditional image generation models like DALL-E or Stable Diffusion, this benchmark requires models to write actual code—shapes and lines—that somehow compose into a recognizable rendering of the actor. The test is simple in concept but reveals interesting results: different AI models produce vastly different interpretations, ranging from reasonable attempts to increasingly weird outputs. By tracking models starting from GPT-3.5 Turbo in March 2023, the creator documented how different models evolved in their ability to generate SVGs, with some improving over time while others produced chaotic or bizarre results. The benchmark includes practical features like sorting by cost, tokens used, and execution time, as well as a timeline view that allows users to filter by provider and track each model's "Gary Busey journey." The entire website was built using Fable. While intentionally ridiculous in premise, the creator notes that the benchmark is actually useful—combining entertainment value with legitimate performance measurement capabilities.

Key Insights

  • The creator uses SVG code generation as a benchmark task because it requires AI models to write actual code that produces visual output, making it fundamentally different from traditional image generation models
  • GPT-3.5 Turbo's March 2023 attempt at drawing Gary Busey via SVG represents an early baseline point in tracking how different AI models evolved at this specific task
  • Different AI models show divergent trajectories when generating SVGs—some improve over time while others produce increasingly chaotic or weird results
  • The benchmark includes practical performance comparison features like cost analysis, token usage tracking, and execution time measurement alongside the humorous visual results
  • The creator built the entire benchmark website using Fable and views the project as intentionally ridiculous in concept but genuinely useful in execution

Topics

AI benchmark designSVG code generationModel comparison and evolutionPerformance metrics trackingHumorous testing methodology

Transcript

[0:00] I built a benchmark where AI models have one job. Draw Gary Buucy using code. It's called Buccy Bench and it's exactly as ridiculous as it sounds. So the test is simple. I asked different AI models to draw Gary Buucy as an SVG. Now it's not a normal AI image. It's not using nano banana or dolly or stable diffusion or anything like that. SVGs are actually code. The model has to write shapes and lines that somehow become Gary Buucy. and that makes the results kind of awesome. Back in March 2023, GPT 3.5 [0:32] Turbo had its own very special interpretation of Gary Buucy. Then you scroll forward and you could watch the models evolve at…

Full transcript available for MurmurCast members

Sign Up to Access

More from Matt Wolfe

Get AI summaries like this delivered to your inbox daily

Get AI summaries delivered to your inbox

MurmurCast summarizes your YouTube channels, podcasts, and newsletters into one daily email digest.