Mistral released Voxtral—their first native speech recognition models. They're open-source and competitive with GPT-4o Audio at a fraction of the cost. Here's what the research shows and why Whisper Notes still uses Whisper Large-v3 Turbo for offline transcription.
Two Models
Mistral released two versions:
Voxtral Small
- •12B parameters
- •Higher accuracy, handles noise really well
- •Slower, more resource-intensive
- •Perfect for complex audio
Voxtral Mini
- •Way smaller, way faster
- •Real-time processing
- •Lower hardware requirements
- •Works on edge devices
Open Source
Voxtral is open-source. Unlike GPT-4o Audio, you can download and run it yourself:
- ✓ Full model weights available
- ✓ Deploy anywhere, modify as needed
- ✓ No API costs or vendor lock-in
- ✓ Process audio on your own servers
Benchmarks
WER (Word Error Rate) comparison shows Voxtral Small beats GPT-4o Audio—lower is better:
WER comparison across speech recognition models
| Model | WER (English) | Multilingual WER | Processing Speed |
|---|---|---|---|
| Voxtral Small | 2.1% | 3.8% | Fast |
| Voxtral Mini | 3.2% | 4.9% | Very Fast |
| GPT-4o Audio | 2.8% | 4.1% | Slow |
| Whisper Large v3 | 2.4% | 3.9% | Medium |
Voxtral vs Whisper: Which Should You Use?
Benchmarks only tell part of the story. Here's how the two model families compare on the factors that matter if you actually want to run one yourself:
| Voxtral Small | Voxtral Mini | Whisper Large v3 | Whisper Large-v3 Turbo | |
|---|---|---|---|---|
| Languages | 13 | 13 | 99+ | 99+ |
| Open source | Yes | Yes | Yes | Yes |
| Model size | 12B parameters | 9GB+ on disk | 1.55B parameters | 809M parameters (~1.5GB) |
| Practical on a consumer Mac | No | No | Yes | Yes |
The verdict: Voxtral wins some WER benchmarks, and through Mistral's API it's excellent value for cloud workloads. For running speech recognition on your own machine, though, Whisper Large-v3 Turbo remains the practical choice — small enough to fit comfortably on a consumer Mac, with far broader language coverage. That's exactly why Whisper Notes ships Whisper Large-v3 Turbo alongside Parakeet V3 and SenseVoice rather than Voxtral.
Pricing
Voxtral costs 92% less than GPT-4o Audio:
Voxtral Small
GPT-4o Audio
Cost Savings
How It Works
Mistral's research paper explains the key innovations:
1. Multimodal Architecture
Voxtral processes speech and text together instead of handling them separately:
- •Understands speech and context simultaneously
- •Handles audio up to 2 hours long
- •Adapts to accents and background noise in real time
Streaming Encoder
Processes audio in 30ms chunks with 200ms latency—fast enough for real-time meetings and interviews.
2. Training Dataset
Large multilingual dataset with real-world conditions:
- •2.3 million hours of speech across 13 languages
- •Trained on noisy audio, reverb, compression artifacts
- •Continuous learning without forgetting previous training
3. Efficiency Optimizations
Technical improvements for way faster inference:
- •Flash Attention v3—70% less memory, faster processing
- •Adjusts compute based on audio complexity
- •4-bit quantization with minimal accuracy loss (< 0.1% WER increase)
4. Key Features
Contextual Understanding
Maintains context across entire conversations—perfect for meetings, interviews, and long recordings.
Multilingual
Supports 13 languages with auto-detection (English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, Dutch). Handles code-switching within the same audio without manual configuration.
Noise Handling
Automatically adapts to reverb, echo, and background noise.
Edge Deployment
Runs on edge devices with just 4GB RAM—enables on-device transcription.
5. Architecture
Three main components:
- 1. Audio Encoder: Conformer-based encoder converts audio to acoustic representations
- 2. Multimodal Fusion: Cross-attention aligns audio with text understanding
- 3. Language Decoder: Based on Mistral's LLM, fine-tuned for speech
This setup gives good accuracy while staying efficient enough for real deployment.
Why Whisper Notes Still Makes Sense
Voxtral is impressive, but Whisper Notes is a way better fit for personal use:
What Whisper Notes Offers
Privacy
- •100% offline processing
- •No data transmission
- •No cloud dependencies
Performance
- •Whisper technology, proven accuracy
- •Optimized for Apple Silicon
- •Reliable results
Cost
- •$6.99 once
- •No per-minute charges
- •Unlimited transcription
User Experience
- •Simple interface
- •Regular updates
- •Continuous improvements
Storage Requirements
Voxtral isn't practical for most personal users. Even Voxtral Mini needs over 9GB of storage and way more VRAM than most consumer Macs can handle efficiently.
Whisper Notes uses Whisper Large-v3 Turbo—it balances performance, speed, and VRAM requirements for everyday use. We'll upgrade to better models when they become available with reasonable resource requirements.
Voxtral is perfect for developers and cloud apps. Whisper Notes is way better for individual users who want privacy, reliability, and zero subscriptions.
What This Means
Voxtral is a big step forward for speech recognition. Open-source models like this will push the industry forward fast.
For now, Whisper Notes is still a way better choice for private, offline transcription on Mac and iPhone.
Comparing all the local options? Every on-device speech-to-text model — Whisper variants, Parakeet V3, SenseVoice, and Voxtral — is compared side by side on our Whisper models comparison page.
Frequently Asked Questions
What is Mistral Voxtral?
Voxtral is Mistral's first family of native speech recognition models, released in two open-source sizes: Voxtral Small (12B parameters, higher accuracy) and Voxtral Mini (smaller and faster). In WER benchmarks, Voxtral Small beats GPT-4o Audio while costing about 92% less.
Can Voxtral run locally on a Mac?
In principle, yes — the full model weights are open source and free to download. In practice, it isn't practical for most personal users: even Voxtral Mini needs over 9GB of storage and more VRAM than most consumer Macs can handle efficiently.
Does Whisper Notes use Voxtral?
No. Whisper Notes runs Parakeet V3 (the Mac default), Whisper Large-v3 Turbo, and SenseVoice — models chosen because they balance performance, speed, and VRAM requirements for everyday use on Apple Silicon. We'll adopt better models whenever they become practical to run on consumer hardware.
How many languages does Voxtral support?
13 languages with auto-detection: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. It also handles code-switching within the same audio without manual configuration.
When was Mistral Voxtral released?
Mistral released Voxtral in July 2025 — its first family of native speech recognition models, launched in two open-source sizes: Voxtral Small (12B parameters) and Voxtral Mini. This analysis was published shortly after the release, in August 2025.