While Google charges developers per-minute fees to transcribe audio and track speakers through Gemini's API, Microsoft dropped a frontier-grade alternative into open source—for free, under the MIT license, with no usage quotas or API keys required.
VibeVoice, released in January 2026 but only now gaining traction, combines speech-to-text and speaker diarization in a single 17.3GB model. Speaker diarization—the ability to distinguish "Speaker 1" from "Speaker 2" in a meeting recording—is a feature Google bundles behind its paywall. Microsoft shipped it as a baseline capability in an open-weight model anyone can download, fine-tune, or deploy locally.
Simon Willison ran benchmarks on a 128GB M5 Max MacBook Pro using the 4-bit MLX conversion (5.71GB). Processing a one-hour podcast took 8 minutes 45 seconds. The model generated transcription tokens at 38.5 tokens per second during the generation phase, with peak memory usage of 30.44GB (Willison observed spikes to 61.5GB during the prefill stage).
The practical command is a single line:
``` uv run --with mlx-audio mlx_audio.stt.generate \ --model mlx-community/VibeVoice-ASR-4bit \ --audio input.mp3 --output-path output \ --format json --max-tokens 32768 ```
The `--max-tokens` flag matters. Default is 8192, which caps output at roughly 25 minutes of audio. Bump it to 32768 to handle full-length recordings without truncation.
The output structure includes speaker-labeled segments natively—no post-processing pipeline required. For developers building meeting summarizers, podcast tools, or call analytics, this eliminates a dependency on cloud APIs. For privacy-conscious teams, running inference locally means audio never leaves the machine.
Microsoft's open-source AI portfolio has grown aggressive: Phi for language, Magma for multimodal, and now VibeVoice for audio. The pattern is clear—identify capabilities Google monetizes through Gemini, replicate them under permissive licenses, and let the developer community debug and optimize them collectively.
GitHub shows 302 stars and 166 comments since the push gained attention. That's modest compared to mainstream repos, but the model is specialized. Among voice AI practitioners, the response is quieter and more technical: this works, it runs on consumer hardware, and the MIT license means commercial deployment without friction.
The benchmark that matters isn't tokens per second—it's cost per transcript. At 38.5 tokens/sec with local inference, the marginal cost of transcribing 1,000 hours of audio is electricity, not API credits. For startups building transcription businesses, that's a structural advantage Google can't easily counter without matching the openness.
VibeVoice is available at `huggingface.co/microsoft/VibeVoice-ASR`. The MLX-optimized 4-bit variant runs on Apple Silicon without additional setup. For everyone else, the full 17.3GB model works on standard CUDA hardware.