NEW Ollama 0.19 Update is INSANE!
Ollama 0.19 introduces a massive speed improvement for local AI on Apple silicon by integrating with Apple's MLX framework, achieving nearly 2x faster response generation and 1.6x faster input processing. The update also includes smarter caching across conversations and support for Nvidia's NVFP4 format, making local AI competitive with cloud services for the first time.
Summary
Ollama 0.19 represents a major breakthrough in local AI performance, specifically for Apple silicon devices. The update integrates with Apple's MLX machine learning framework, which takes advantage of unified memory architecture where CPU and GPU share the same memory pool without transfer overhead. Benchmark results using Alibaba's Qwen 3.5 35B model show prefill speed increasing from 1,154 to 1,110 tokens per second (1.6x improvement) and decode speed jumping from 58 to 112 tokens per second (nearly 2x improvement). With INT4 quantization, speeds can reach up to 134 tokens per second on decode. The update also introduces intelligent caching that preserves context across conversations, eliminating the need to reprocess project files and instructions from scratch each session. This particularly benefits coding agents and daily assistant tools. Additionally, Ollama 0.19 supports Nvidia's NVFP4 format for model compression, allowing larger models to run on the same hardware while maintaining accuracy. The update requires Mac devices with Apple silicon and more than 32GB of unified memory. This represents a fundamental shift in the local vs. cloud AI trade-off, making local AI genuinely fast rather than just a privacy-focused compromise.
Key Insights
- The creator states that Apple silicon chips use unified memory where CPU and GPU share one memory pool with no copying or transfer overhead, unlike traditional computers where CPU and GPU have separate memory pools
- Ollama's own testing shows that version 0.19 with MLX achieves 1,110 tokens per second on prefill (1.6x increase) and 112 tokens per second on decode (nearly double) compared to version 0.18
- The speaker explains that Ollama 0.19 can now reuse cache across conversations by storing intelligent checkpoints, so when branching into new conversations the model picks up from where it left off instead of reprocessing everything
- The creator argues that local AI has had a perception problem where people assumed cloud was for performance and local was only for privacy purists or tinkerers, but Ollama 0.19 is shifting that narrative
- The speaker claims that Apple's MLX framework has been shown in independent research to achieve some of the highest throughput numbers for AI inference on Apple silicon, outperforming older backends by 20 to 30% in sustained generation speed
Topics
Full transcript available for MurmurCast members
Sign Up to Access