Model Release Synthesized from 3 sources

NVIDIA's 30B Nano Omni Tops 6 Leaderboards at 9x Efficiency

Key Points

• 30B-A3B MoE architecture delivers 9x throughput vs. rivals
• Single-pass processing of video, audio, images, and text
• Tops six multimodal and document intelligence leaderboards
• CRADIO v4-H vision + Parakeet audio + Mamba2 LLM unified
• 131K token context for full-hour video plus transcripts

References (3)

[1] NVIDIA Nemotron 3 Nano Omni unifies vision, audio, language — NVIDIA AI Blog ↗
[2] NVIDIA Releases Nemotron 3 Nano Omni Model — Hugging Face Blog ↗
[3] NVIDIA Nemotron 3 Nano Omni hits SageMaker JumpStart at 30B — AWS Machine Learning Blog ↗

What if an AI agent could watch your screen, listen to a customer call, and read a PDF — all in the same thought? Until today, that was a fantasy. Most agentic systems stitch together separate models for vision, speech, and language, passing data back and forth in a relay race that fragments context and burns compute. NVIDIA's new Nemotron 3 Nano Omni ends that compromise.

Unveiled April 28th, this open multimodal model processes video, audio, images, and text in a single inference pass — no routing between models, no context switching, no accumulated latency. Built on a 30B-A3B hybrid MoE architecture, it activates only 3 billion parameters per forward pass while maintaining access to 30 billion total. The result: 9x higher throughput than comparable open omni models, according to NVIDIA's internal benchmarks.

The architecture combines three specialized encoders under one roof. CRADIO v4-H handles vision — screens, documents, charts, and video frames. Parakeet processes audio — call recordings, voice notes, ambient sound. A Mamba2 Transformer backbone reasons across all of it and generates text output. Supporting 131K token context, the model can hold an hour of video plus transcripts in a single reasoning window.

H Company, the AI company behind the French virtual assistant, is already deploying it in production. "To build useful agents, you can't wait seconds for a model to interpret a screen," said CEO Gautier Cloix. "By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings — something that wasn't practical before."

The model tops six leaderboards for document intelligence and multimodal understanding. Enterprise adopters include Foxconn, Palantir, and Docusign, with Dell, Oracle, and Infosys evaluating. It runs FP8 on Amazon SageMaker JumpStart, with day-zero availability via Hugging Face and 25+ platforms.

At its core, Nemotron 3 Nano Omni collapses what used to be an orchestration nightmare into a single model call. Developers no longer need to synchronize vision, audio, and language models — or pay for three separate inference runs. The price of fragmentation was latency; the cost of unification was accuracy. This model suggests that trade-off is finally obsolete.