Whisper Large-v3 Turbo cua OpenAI cat giam decoder tu 32 lop xuong 4, giam tham so tu 1,55B xuong 809M. Ket qua: phien am nhanh hon 2–5 lan voi do chinh xac gan nhu tuong duong. Whisper Notes tich hop mo hinh nay tren Mac voi Apple Silicon.
V3 Turbo vs V3: Nhung gi da thay doi
Turbo khong phai la kien truc moi. Day chinh xac la mo hinh Whisper Large-v3 voi decoder duoc cat giam tu 32 lop xuong 4, sau do duoc fine-tune de phuc hoi do chinh xac. Encoder khong bi thay doi.
| Large-v3 Turbo | Large-v3 | |
|---|---|---|
| Tham so | 809M | 1,550M |
| Lop decoder | 4 | 32 |
| Ngon ngu | 99 | 99 |
| Tac vu dich | Khong ho tro | Ho tro |
| Giay phep | MIT | Apache 2.0 |
Tac vu dich duoc loai tru ro rang khoi du lieu huan luyen cua Turbo. Mo hinh Large-v3 day du ho tro tinh nang nay, nhung Whisper Notes chi su dung Turbo — viec dich duoc xu ly rieng thong qua Apple Intelligence.
Benchmark toc do: Whisper Notes tren Apple Silicon
Trong Whisper Notes cho Mac, Turbo chay thong qua CoreML tren Neural Engine. Xu ly 10 phut am thanh:
| Thiet bi | Whisper V3 | V3 Turbo | Tang toc |
|---|---|---|---|
| iPhone 15 Pro | 425 s | 82 s | 5.2× |
| iPad Pro M2 | 380 s | 71 s | 5.4× |
| MacBook Pro M2 | 316 s | 63 s | 5.0× |
Muc tang toc 5× nay la dac thu cho Whisper Notes tren Apple Silicon, noi decoder nho hon duoc huong loi tu toi uu hoa Neural Engine. Tren GPU voi cac framework nhu faster-whisper, khoang cach thu hep con ~2,7× (xem benchmark cong dong ben duoi).
Do chinh xac: So sanh WER
Hugging Face Open ASR Leaderboard thu nghiem ca hai mo hinh tren cung cac dataset tieng Anh. Word error rate cua Turbo nam trong nua diem so voi V3 tren moi benchmark:
| Dataset | V3 Turbo WER | V3 WER |
|---|---|---|
| LibriSpeech Clean | 2.10% | 2.01% |
| LibriSpeech Other | 4.24% | 3.91% |
| GigaSpeech | 10.14% | 10.02% |
| Earnings22 | 11.63% | 11.29% |
| AMI | 16.13% | 15.95% |
| WER trung binh | 7.83% | 7.44% |
V3 chinh xac hon mot chut tren moi dataset, nhung su chenh lech la nho — trung binh 0,39 diem phan tram. Doi voi hau het cac ban phien am thuc te, ban se khong nhan thay su khac biet.
Tren danh gia YouTube-commons dang dai (mot trong nhung benchmark ASR ma nguon mo lon nhat), Turbo dat 13.40% WER so voi 13.20% cua V3 — trong khi chay o he so thoi gian thuc 129.5× so voi 55.3×. Do la nhanh hon 2,3 lan voi do chinh xac gan nhu tuong duong tren am thanh thuc te.
Benchmark cong dong: GPU va CPU
Cac benchmark doc lap tu cong dong faster-whisper va whisper.cpp cho thay ket qua nhat quan tren nhieu phan cung khac nhau. Phien am 13 phut am thanh voi faster-whisper tren GPU:
| Mo hinh | Do chinh xac | Thoi gian | Bo nho GPU | WER |
|---|---|---|---|---|
| Large-v3 Turbo | fp16 | 19.2 s | 2,537 MB | 1.92% |
| Large-v3 | fp16 | 52.0 s | 4,521 MB | 2.88% |
| Large-v3 Turbo | int8 | 19.6 s | 1,545 MB | 1.92% |
| Distil-Large-v3 | fp16 | 26.1 s | 2,409 MB | 2.39% |
Nguon: benchmark faster-whisper tren NVIDIA GPU, split xac nhan LibriSpeech clean. Turbo int8 chi su dung 1.5 GB VRAM — vua du cho GPU 2 GB.
Suy luan theo lo tren RTX 3060 Laptop (6 GB VRAM, do chinh xac int8) dua loi the di xa hon:
| Mo hinh | Tuan tu | Theo lo (10) | WER theo lo |
|---|---|---|---|
| Large-v3 Turbo | 46.1 s | 18.7 s | 7.7% |
| Large-v3 | 230.8 s | 43.0 s | 7.9% |
| Large-v2 | 178.3 s | 43.2 s | 8.8% |
| Medium | 113.3 s | 26.3 s | 8.9% |
Nguon: benchmark NilaierMusic, Intel i7-12650H + RTX 3060 Laptop 6 GB, am thanh tieng Phap, do chinh xac int8.
Voi xu ly theo lo, Turbo dat WER tot nhat trong tat ca cac mo hinh duoc thu nghiem (7,7%) dong thoi la nhanh nhat. Day ro rang la diem toi uu cho viec su dung trong san xuat.
Han che da biet (va cach Whisper Notes xu ly chung)
Khong co tinh nang dich tich hop
Turbo duoc huan luyen khong co du lieu dich. No chi phien am bang ngon ngu goc — khong giong nhu Large-v3, ho tro dich am thanh sang tieng Anh.
Whisper Notes — Apple Intelligence tu dong dich ban phien am sang ngon ngu ban chon, cho ban dau ra song ngu bat ke ban su dung mo hinh nao.
Nhieu ao giac hon voi am thanh nhieu tap
Bao cao tu cong dong cho thay Turbo ao giac nhieu hon tren cac doan clip rat ngan hoac ban ghi nhieu tap so voi V3. Dieu nay duoc du kien voi decoder nho hon (4 lop so voi 32).
Whisper Notes — chay Pyannote VAD truoc khi phien am, phat hien cac doan co giong noi va loai bo im lang/tap am de mo hinh chi xu ly giong noi thuc.
Ban nen su dung mo hinh nao?
| Tieng Anh / Chau Au | Parakeet V3 — nhanh hon Whisper 10 lan, do chinh xac tot hon |
| Tieng Trung / Nhat / Han | SenseVoice — duoc xay dung chuyen cho CJK, toc do 52× |
| Ngon ngu khac | Whisper Large V3 Turbo — 99 ngon ngu, do chinh xac cao, cham hon |