GLM-5.1 (Fully Tested): THE BEST OPEN / AGENTIC MODEL IS HERE! This is CRAZY!
The speaker provides early access testing results for GLM 5.1, revealing it's a post-training update of GLM5 that's significantly better for agentic tasks but tends to overuse code unnecessarily in regular chat scenarios. Despite ranking fifth on general benchmarks due to regression in non-agentic conversations, it achieves second position on agentic leaderboards.
Summary
The video presents an early access review of GLM 5.1, a post-training update of the GLM5 model that maintains the same parameters but focuses heavily on improving long-running and agentic tasks. The speaker notes that while GLM 4.7 struggled with long-running tasks, both GLM5 and the new 5.1 version excel in this area, with 5.1 showing even better performance. However, the model has developed a problematic tendency to use code unnecessarily in regular conversations, often creating HTML files or code blocks even for simple questions like riddles, making it less pleasant for general chat purposes. The speaker attributes this behavior to increased training on code data and reinforcement learning for coding tasks. On the positive side, GLM 5.1 shows remarkable improvement in agentic applications, demonstrating excellent instruction following, debugging capabilities, and focus without deviation from objectives. Unlike GLM5, which would sometimes overdo reasoning and slow down simple tasks, the new version is more efficient and snappy. The model also shows better planning abilities and context understanding. In benchmark tests, GLM 5.1 performs exceptionally well on coding tasks like floor plans, SVG generation, 3D graphics, games, and various applications, but struggles with general math and chat questions. For agentic tasks specifically, the model ranks second on agentic leaderboards and is compared favorably to Opus 4.6 and CodeX, with the speaker considering switching to this model due to its performance relative to its low cost.
Key Insights
- GLM 5.1 has been trained more heavily on code which causes it to unnecessarily create HTML files and code blocks even for simple questions like riddles, making regular chat experiences less pleasant
- Unlike GLM5 which would do excessive reasoning that slowed down simple tasks, GLM 5.1 has been optimized to not over-reason where unnecessary, making it feel much snappier
- GLM 5.1 ranks second position on agentic leaderboards despite being an open model, performing comparably to Opus 4.6 and better than CodeX while being significantly cheaper
Topics
Full transcript available for MurmurCast members
Sign Up to Access