New Nvidia Nemotron 3 Nano Omni Update Changes Everything!
Nvidia released Nemotron 3 Nano Omni on April 28th, 2026, a free 30-billion-parameter multimodal AI model that can simultaneously process text, images, audio, and video. It runs 9.2 times faster than competing models on video tasks and outperforms previous open Omni models across all major benchmarks. The video covers its technical architecture, benchmark results, and practical business applications.
Summary
The video introduces Nvidia's Nemotron 3 Nano Omni, released April 28th, 2026, as a major leap in open-source multimodal AI. Unlike most AI models that specialize in a single modality, this model can process text, images, audio (up to 1 hour), and video (up to 2 minutes) simultaneously in a single pass, with a 256K context window for handling large documents. The presenter frames this as transformative for small business owners who are overwhelmed by PDFs, voice notes, screen recordings, and training videos.
The technical architecture is explained in accessible terms. The model uses 30 billion parameters but only activates roughly 3 billion at a time through a Mixture of Experts (MoE) design, where specialized sub-models are selectively engaged depending on the query. For video processing, Nvidia introduced Conv3D Tubelet Embedding, which processes two video frames simultaneously instead of one, and Efficient Video Sampling, which skips redundant frames where little is happening and focuses attention on frames with meaningful activity.
Benchmark results are presented across multiple evaluation categories: OCR Bench V2 (65.8% vs. 61.2% for the prior model), Video MME (72.2% vs. 70.5%), Voice Bench (89.4% vs. 88.8%), M Long Bench Doc for document analysis (57.5%), and Screen Spot Pro for on-screen UI understanding (57.8%). Nvidia claims 9.2x efficiency on video tasks and 7.4x on multi-document tasks, meaning a task that previously took 9 minutes now takes roughly 1 minute.
The presenter outlines deployment options: DeepInfra for API-based access with OpenAI-compatible endpoints, and Hugging Face for local deployment via Unsloth, with multiple quantization options (BF16, FP4, NVFP4) for varying hardware capabilities. A practical use case is illustrated with a real estate agent using the model to auto-generate property descriptions and identify issues from 50 walkthrough videos. The model's Screen Spot Pro score is highlighted as enabling agentic screen interaction — bots that can read a screen and autonomously click, fill forms, or gather data.
The video closes with a broader observation that open multimodal AI has advanced dramatically in one year, and promotes two communities: the paid AI Profit Boardroom and the free AI Success Lab with 67,000 members.
Key Insights
- The presenter claims Nemotron 3 Nano Omni uses a Mixture of Experts architecture with 30 billion total parameters but only activates approximately 3 billion at a time, which is the primary reason it achieves 9x faster inference than comparable multimodal models.
- Nvidia introduced Conv3D Tubelet Embedding and Efficient Video Sampling to handle video — processing two frames at once and skipping static frames — allowing the model to analyze a 2-minute video without the computational cost that normally makes video processing prohibitively slow.
- The presenter argues that Nemotron 3 Nano Omni's 9.2x efficiency on video tasks and 7.4x efficiency on multi-document tasks means a real-world task that previously took 9 minutes now takes approximately 1 minute, making it viable for agents processing thousands of documents or hours of recordings daily.
- The presenter highlights the model's Screen Spot Pro score of 57.8% as evidence that it can understand on-screen UI elements and perform autonomous computer interactions — clicking, form-filling, and data gathering — describing this as a capability that 'used to be science fiction' and is now available as a free download.
- The presenter frames one year of open multimodal AI progress as equivalent to five years of prior advancement, noting that last year's best open Omni models could barely handle a single image with a paragraph of text, while Nemotron 3 Nano Omni now watches video, processes hour-long audio, and reads massive documents simultaneously.
Topics
Full transcript available for MurmurCast members
Sign Up to Access