Mistral released Voxtral—their first native speech recognition models. They're open-source and competitive with GPT-4o Audio at a fraction of the cost. Here's what the research shows and why Whisper Notes still uses Whisper Large-v3 Turbo for offline transcription.
Two Models
Mistral released two versions:
Voxtral Small
- •12B parameters
- •Higher accuracy, handles noise really well
- •Slower, more resource-intensive
- •Perfect for complex audio
Voxtral Mini
- •Way smaller, way faster
- •Real-time processing
- •Lower hardware requirements
- •Works on edge devices
Open Source
Voxtral is open-source. Unlike GPT-4o Audio, you can download and run it yourself:
- ✓ Full model weights available
- ✓ Deploy anywhere, modify as needed
- ✓ No API costs or vendor lock-in
- ✓ Process audio on your own servers
Benchmarks
WER (Word Error Rate) comparison shows Voxtral Small beats GPT-4o Audio—lower is better:
WER comparison across speech recognition models
| Model | WER (English) | Multilingual WER | Processing Speed |
|---|---|---|---|
| Voxtral Small | 2.1% | 3.8% | Fast |
| Voxtral Mini | 3.2% | 4.9% | Very Fast |
| GPT-4o Audio | 2.8% | 4.1% | Slow |
| Whisper Large v3 | 2.4% | 3.9% | Medium |
Pricing
Voxtral costs 92% less than GPT-4o Audio:
Voxtral Small
GPT-4o Audio
Cost Savings
How It Works
Mistral's research paper explains the key innovations:
1. Multimodal Architecture
Voxtral processes speech and text together instead of handling them separately:
- •Understands speech and context simultaneously
- •Handles audio up to 2 hours long
- •Adapts to accents and background noise in real time
Streaming Encoder
Processes audio in 30ms chunks with 200ms latency—fast enough for real-time meetings and interviews.
2. Training Dataset
Large multilingual dataset with real-world conditions:
- •2.3 million hours of speech across 108 languages
- •Trained on noisy audio, reverb, compression artifacts
- •Continuous learning without forgetting previous training
3. Efficiency Optimizations
Technical improvements for way faster inference:
- •Flash Attention v3—70% less memory, faster processing
- •Adjusts compute based on audio complexity
- •4-bit quantization with minimal accuracy loss (< 0.1% WER increase)
4. Key Features
Contextual Understanding
Maintains context across entire conversations—perfect for meetings, interviews, and long recordings.
Multilingual
Supports 108 languages with auto-detection. Handles code-switching within the same audio seamlessly.
Noise Handling
Automatically adapts to reverb, echo, and background noise.
Edge Deployment
Runs on edge devices with just 4GB RAM—enables on-device transcription.
5. Architecture
Three main components:
- 1. Audio Encoder: Conformer-based encoder converts audio to acoustic representations
- 2. Multimodal Fusion: Cross-attention aligns audio with text understanding
- 3. Language Decoder: Based on Mistral's LLM, fine-tuned for speech
This setup gives good accuracy while staying efficient enough for real deployment.
Why Whisper Notes Still Makes Sense
Voxtral is impressive, but Whisper Notes is a way better fit for personal use:
What Whisper Notes Offers
Privacy
- •100% offline processing
- •No data transmission
- •No cloud dependencies
Performance
- •Whisper technology, proven accuracy
- •Optimized for Apple Silicon
- •Reliable results
Cost
- •$4.99 once
- •No per-minute charges
- •Unlimited transcription
User Experience
- •Simple interface
- •Regular updates
- •Continuous improvements
Storage Requirements
Voxtral isn't practical for most personal users. Even Voxtral Mini needs over 9GB of storage and way more VRAM than most consumer Macs can handle efficiently.
Whisper Notes uses Whisper Large-v3 Turbo—it balances performance, speed, and VRAM requirements for everyday use. We'll upgrade to better models when they become available with reasonable resource requirements.
Voxtral is perfect for developers and cloud apps. Whisper Notes is way better for individual users who want privacy, reliability, and zero subscriptions.
What This Means
Voxtral is a big step forward for speech recognition. Open-source models like this will push the industry forward fast.
For now, Whisper Notes is still a way better choice for private, offline transcription on Mac and iPhone.
Try Whisper Notes
Offline transcription for iPhone and Mac. $4.99 once, no subscription.
Download Whisper Notes