OpinionNews

Qwen 3.6 Max DESTROYS Opus 4.5, Gemini 3, Deepseek v4!

Julian Goldie SEO

The video analyzes Alibaba's Qwen 3.6 Max preview model, released April 20, 2026, which claims top spots on six coding benchmarks. The presenter finds the claims partially valid but exposes misleading benchmark comparisons, particularly against outdated Claude versions. The conclusion is that no single model dominates all tasks in 2026.

Summary

The video provides a critical breakdown of Alibaba's Qwen 3.6 Max preview model, released April 20, 2026, which claims to top six major coding benchmarks simultaneously. The presenter introduces the model's technical specs: a mixture-of-experts architecture with ~35 billion total parameters (only 3 billion active per request), a 256,000 token context window, and text-only input — meaning no image or vision support. It is accessible via Qwen Studio and Alibaba Cloud API.

On benchmarks, the model shows notable gains over its predecessor Qwen 3.6 Plus, including a 10.8-point improvement on Code and a 9.9-point jump on Skills Bench. The presenter highlights the Psycho (scientific coding) improvement as particularly meaningful, as it reflects genuine technical reasoning ability rather than surface-level performance.

However, the presenter raises a significant credibility issue: Alibaba's benchmark comparisons used Claude Opus 4.5 as the baseline rather than the newer Opus 4.6 or 4.7, making Qwen's wins appear larger than they are. On Terminal Bench 2.0, for example, Alibaba listed a Qwen win of 65.4% vs. Claude's 59.3% — but that Claude score is Opus 4.5, while Opus 4.6 also scored 65.4%, making it a tie, not a Qwen victory.

The video then profiles the competition. Claude Opus 4.5 was the first model to break 80% on SWE-Bench Verified (80.9%) and leads on Ada Polyglot. Gemini 3 Pro topped the Web Dev Arena leaderboard at 1487 Elo and scored 76.2% on SWE-Bench Verified, with Gemini 3.1 Pro later jumping to 80.6%. DeepSeek V4, released just four days after Qwen 3.6 Max, scored 80.6% on SWE-Bench Verified, leads competitive programming with a Codeforces rating of 3206, and crucially is open weights under MIT license — a stark contrast to Qwen 3.6 Max, which is closed weights, breaking from Alibaba's previous Apache 2.0 practice.

Head-to-head comparisons show Opus 4.5 leading Qwen overall (80 vs. 72 on Bench LLM's leaderboard), with Qwen holding an edge only in knowledge tasks. On front-end work, Qwen claims an Elo of 1558 on its own Qwen Web Bench versus Opus 4.5's 1182 — but the presenter cautions this is Alibaba's proprietary benchmark. DeepSeek V4 Pro outperforms Qwen on Terminal Bench 2.0 (67.9% vs. 65.4%) and competitive programming. Qwen also generates output at only 33 tokens per second, well below the ~62 token median for models in its tier.

The presenter identifies legitimate use cases: agentic coding workflows (improved tool calling), scientific and engineering code, front-end generation, and a new 'preserved thinking' feature that maintains reasoning context across agent turns. Limitations include text-only input, preview status with no production SLA, verbosity on long tasks, and a noted tendency to hallucinate API details like function names and parameters.

The video concludes with six practical tips: test models on your own tasks rather than relying solely on benchmarks, use Opus 4.6/4.7 for production-grade reliability, use Gemini 3.1 Pro for large context needs, test Qwen independently for front-end work, match model complexity to task complexity, and avoid locking workflows to preview models. The overall verdict is that Qwen 3.6 Max is strong in specific areas but is not the all-around leader its marketing implies.

Key Insights

  • The presenter argues that Alibaba's benchmark comparisons are misleading because they used Claude Opus 4.5 as the baseline instead of the newer 4.6 or 4.7 — turning what appeared to be a clear Qwen win on Terminal Bench 2.0 into an actual tie with Opus 4.6, both scoring 65.4%.
  • DeepSeek V4, released just four days after Qwen 3.6 Max, matches or beats it on key benchmarks including Terminal Bench 2.0 (67.9% vs. 65.4%) and holds the highest Codeforces rating ever recorded at model release (3206), while also being open weights under MIT license — a major competitive differentiator.
  • The presenter flags that Qwen 3.6 Max has broken from Alibaba's prior practice of releasing models under Apache 2.0, launching instead as closed weights — a significant shift the presenter calls out as a 'big shift for Alibaba.'
  • Qwen 3.6 Max generates output at only 33 tokens per second, roughly half the ~62 token per second median for reasoning models in its tier, meaning users should expect noticeably slower response times compared to competing models.
  • The presenter warns that Qwen models have been noted by reviewers to hallucinate API details — fabricating function names and parameters that do not actually exist — making it essential to manually verify any code the model generates.

Topics

Qwen 3.6 Max benchmark performance and limitationsMisleading benchmark comparisons using outdated Claude versionsHead-to-head comparison with Claude Opus, Gemini 3, and DeepSeek V4Open vs. closed weights model licensingPractical use case recommendations for AI coding tools

Full transcript available for MurmurCast members

Sign Up to Access

Get AI summaries like this delivered to your inbox daily

Get AI summaries delivered to your inbox

MurmurCast summarizes your YouTube channels, podcasts, and newsletters into one daily email digest.