DeepSeek ha appena reso TUTTI gli LLM più veloci
DeepSeek's new Spark technique uses semi-autoregressive speculative decoding to accelerate LLM inference by 51-400% without quality loss or model retraining. By combining a fast parallel draft model with an efficient verification process, Spark achieves higher token acceptance rates than competing methods like Eagle 3 and Flash, enabling faster inference on consumer hardware.
Summary
The video explains speculative decoding, a technique that pairs a large target model with a much smaller draft model to speed up token generation. The draft model quickly generates candidate tokens while the large model verifies their correctness, avoiding the computational bottleneck of sequential autoregressive generation. DeepSeek's innovation, Spark, improves upon earlier speculative decoding approaches (Eagle 3 and Flash) by addressing three key problems: draft model speed, token acceptance rate, and verification efficiency.
Spark uses a semi-autoregressive architecture where a parallel block rapidly generates four draft tokens, followed by a sequential block that injects dependencies between tokens to maintain logical coherence. Each token receives a confidence score between 0 and 1; tokens below a threshold are discarded before reaching the target model. The remaining candidates are passed to the target model for final verification. This approach achieves substantially higher token acceptance rates (around 61%) compared to alternatives.
Benchmark results show Spark delivers 52-57% throughput improvements on V4 Flash and 406-661% on V4 Pro, translating to nearly double the tokens per second. The speaker emphasizes that combining Spark with other optimization techniques (dynamic quantization, SSD streaming, inference engines) could enable high-quality LLMs to run effectively on consumer PCs at speeds comparable to cloud-based services, democratizing access to powerful models.
Key Insights
- Speculative decoding works by having a small draft model generate tokens in parallel while a large target model verifies correctness in a single pass, analogous to one person speaking for you in a meeting while you listen and correct errors, making inference cognitively lighter than sequential autoregressive generation
- Spark achieves better token acceptance rates than Eagle 3 and Flash by using a semi-autoregressive architecture combining parallel draft generation with a sequential block that injects dependencies between tokens, maintaining logical coherence
- The hardware-aware prefix scheduler in Spark pre-filters draft tokens based on confidence scores, eliminating low-confidence candidates before presenting them to the target model, reducing unnecessary verification work
- Spark achieves 406% to 661% throughput improvements on production DeepSeek models (V4 Flash and V4 Pro) without requiring model retraining or weight modification, demonstrating the technique's universal applicability
- Combining Spark with other optimization techniques like dynamic quantization and SSD streaming could enable consumer PCs to run advanced LLMs at 50-60 tokens per second, making cloud-dependent models like GPT accessible locally
Topics
Transcript
[0:00] Every time Deepsik releases a new scientific paper it is a masterpiece of engineering and optimization. For me, making these videos is pure enjoyment, that is, I have a lot of fun explaining the insights and what's behind them in a simple way. So, in this video, we're going to see what Spark is, this new semi-autoregressive speculative decoding technique that leads to efficiency gains of 51% [0:31] to 400% in certain cases in output generation speed, without losing quality, without retraining, and this is absurd. it is not a quantization technique, it does not touch the weights of the model, it is a technique that keeps the model as it is, it uses speculative decoding and in certain…
Full transcript available for MurmurCast members
Sign Up to Access