Mistral Voxtral vs Whisper vs GPT-4o: Speech AI Benchmark

Q: What is Mistral Voxtral?

Voxtral is Mistral's first family of native speech recognition models, released in two open-source sizes: Voxtral Small (12B parameters, higher accuracy) and Voxtral Mini (smaller and faster). In WER benchmarks, Voxtral Small beats GPT-4o Audio while costing about 92% less.

Q: Can Voxtral run locally on a Mac?

In principle, yes — the full model weights are open source and free to download. In practice, it isn't practical for most personal users: even Voxtral Mini needs over 9GB of storage and more VRAM than most consumer Macs can handle efficiently.

Q: Does Whisper Notes use Voxtral?

No. Whisper Notes runs Parakeet V3 (the Mac default), Whisper Large-v3 Turbo, and SenseVoice — models chosen because they balance performance, speed, and VRAM requirements for everyday use on Apple Silicon. We'll adopt better models whenever they become practical to run on consumer hardware.

Q: How many languages does Voxtral support?

13 languages with auto-detection: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. It also handles code-switching within the same audio without manual configuration.

Q: When was Mistral Voxtral released?

Mistral released Voxtral in July 2025 — its first family of native speech recognition models, launched in two open-source sizes: Voxtral Small (12B parameters) and Voxtral Mini. This analysis was published shortly after the release, in August 2025.

Mistral released Voxtral—their first native speech recognition models. They're open-source and competitive with GPT-4o Audio at a fraction of the cost. Here's what the research shows and why Whisper Notes still uses Whisper Large-v3 Turbo for offline transcription.

Two Models

Mistral released two versions:

Voxtral Small

•12B parameters
•Higher accuracy, handles noise really well
•Slower, more resource-intensive
•Perfect for complex audio

Voxtral Mini

•Way smaller, way faster
•Real-time processing
•Lower hardware requirements
•Works on edge devices

Open Source

Voxtral is open-source. Unlike GPT-4o Audio, you can download and run it yourself:

✓ Full model weights available
✓ Deploy anywhere, modify as needed
✓ No API costs or vendor lock-in
✓ Process audio on your own servers

Benchmarks

WER (Word Error Rate) comparison shows Voxtral Small beats GPT-4o Audio—lower is better:

Voxtral WER Benchmark Comparison across all models

WER comparison across speech recognition models

Model	WER (English)	Multilingual WER	Processing Speed
Voxtral Small	2.1%	3.8%	Fast
Voxtral Mini	3.2%	4.9%	Very Fast
GPT-4o Audio	2.8%	4.1%	Slow
Whisper Large v3	2.4%	3.9%	Medium

Voxtral vs Whisper: Which Should You Use?

Benchmarks only tell part of the story. Here's how the two model families compare on the factors that matter if you actually want to run one yourself:

	Voxtral Small	Voxtral Mini	Whisper Large v3	Whisper Large-v3 Turbo
Languages	13	13	99+	99+
Open source	Yes	Yes	Yes	Yes
Model size	12B parameters	9GB+ on disk	1.55B parameters	809M parameters (~1.5GB)
Practical on a consumer Mac	No	No	Yes	Yes

The verdict: Voxtral wins some WER benchmarks, and through Mistral's API it's excellent value for cloud workloads. For running speech recognition on your own machine, though, Whisper Large-v3 Turbo remains the practical choice — small enough to fit comfortably on a consumer Mac, with far broader language coverage. That's exactly why Whisper Notes ships Whisper Large-v3 Turbo alongside Parakeet V3 and SenseVoice rather than Voxtral.

Pricing

Voxtral costs 92% less than GPT-4o Audio:

Voxtral Small

$0.20

per million tokens

GPT-4o Audio

$2.50

per million tokens

Cost Savings

92%

vs GPT-4o Audio

How It Works

Mistral's research paper explains the key innovations:

1. Multimodal Architecture

Voxtral processes speech and text together instead of handling them separately:

•Understands speech and context simultaneously
•Handles audio up to 2 hours long
•Adapts to accents and background noise in real time

Streaming Encoder

Processes audio in 30ms chunks with 200ms latency—fast enough for real-time meetings and interviews.

2. Training Dataset

Large multilingual dataset with real-world conditions:

•2.3 million hours of speech across 13 languages
•Trained on noisy audio, reverb, compression artifacts
•Continuous learning without forgetting previous training

3. Efficiency Optimizations

Technical improvements for way faster inference:

•Flash Attention v3—70% less memory, faster processing
•Adjusts compute based on audio complexity
•4-bit quantization with minimal accuracy loss (< 0.1% WER increase)

4. Key Features

Contextual Understanding

Maintains context across entire conversations—perfect for meetings, interviews, and long recordings.

Multilingual

Supports 13 languages with auto-detection (English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, Dutch). Handles code-switching within the same audio without manual configuration.

Noise Handling

Automatically adapts to reverb, echo, and background noise.

Edge Deployment

Runs on edge devices with just 4GB RAM—enables on-device transcription.

5. Architecture

Three main components:

1. Audio Encoder: Conformer-based encoder converts audio to acoustic representations
2. Multimodal Fusion: Cross-attention aligns audio with text understanding
3. Language Decoder: Based on Mistral's LLM, fine-tuned for speech

This setup gives good accuracy while staying efficient enough for real deployment.

Why Whisper Notes Still Makes Sense

Voxtral is impressive, but Whisper Notes is a way better fit for personal use:

What Whisper Notes Offers

Privacy

•100% offline processing
•No data transmission
•No cloud dependencies

Performance

•Whisper technology, proven accuracy
•Optimized for Apple Silicon
•Reliable results

Cost

•$6.99 once
•No per-minute charges
•Unlimited transcription

User Experience

•Simple interface
•Regular updates
•Continuous improvements

Storage Requirements

Voxtral isn't practical for most personal users. Even Voxtral Mini needs over 9GB of storage and way more VRAM than most consumer Macs can handle efficiently.

Whisper Notes uses Whisper Large-v3 Turbo—it balances performance, speed, and VRAM requirements for everyday use. We'll upgrade to better models when they become available with reasonable resource requirements.

Voxtral is perfect for developers and cloud apps. Whisper Notes is way better for individual users who want privacy, reliability, and zero subscriptions.

What This Means

Voxtral is a big step forward for speech recognition. Open-source models like this will push the industry forward fast.

For now, Whisper Notes is still a way better choice for private, offline transcription on Mac and iPhone.

Comparing all the local options? Every on-device speech-to-text model — Whisper variants, Parakeet V3, SenseVoice, and Voxtral — is compared side by side on our Whisper models comparison page.

Frequently Asked Questions

What is Mistral Voxtral?

Voxtral is Mistral's first family of native speech recognition models, released in two open-source sizes: Voxtral Small (12B parameters, higher accuracy) and Voxtral Mini (smaller and faster). In WER benchmarks, Voxtral Small beats GPT-4o Audio while costing about 92% less.

Can Voxtral run locally on a Mac?

In principle, yes — the full model weights are open source and free to download. In practice, it isn't practical for most personal users: even Voxtral Mini needs over 9GB of storage and more VRAM than most consumer Macs can handle efficiently.

Does Whisper Notes use Voxtral?

No. Whisper Notes runs Parakeet V3 (the Mac default), Whisper Large-v3 Turbo, and SenseVoice — models chosen because they balance performance, speed, and VRAM requirements for everyday use on Apple Silicon. We'll adopt better models whenever they become practical to run on consumer hardware.

How many languages does Voxtral support?

13 languages with auto-detection: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. It also handles code-switching within the same audio without manual configuration.

When was Mistral Voxtral released?

Mistral released Voxtral in July 2025 — its first family of native speech recognition models, launched in two open-source sizes: Voxtral Small (12B parameters) and Voxtral Mini. This analysis was published shortly after the release, in August 2025.

Download for iOS

Download for macOS

Two Models

Voxtral Small

Voxtral Mini

Open Source

Benchmarks

Voxtral vs Whisper: Which Should You Use?

Pricing

Voxtral Small

GPT-4o Audio

Cost Savings

How It Works

1. Multimodal Architecture

Streaming Encoder

2. Training Dataset

3. Efficiency Optimizations

4. Key Features

Contextual Understanding

Multilingual

Noise Handling

Edge Deployment

5. Architecture

Why Whisper Notes Still Makes Sense

What Whisper Notes Offers

Privacy

Performance

Cost

User Experience

Storage Requirements

What This Means

Frequently Asked Questions

Related