Whisper Large V3 Turbo vs V3: Nhanh Hơn 5 Lần trên Mac (Benchmark)

6 tháng 11, 2024
·
6 min read
·Whisper Notes Team

Whisper Large-v3 Turbo cua OpenAI cat giam decoder tu 32 lop xuong 4, giam tham so tu 1,55B xuong 809M. Ket qua: phien am nhanh hon 2–5 lan voi do chinh xac gan nhu tuong duong. Whisper Notes tich hop mo hinh nay tren Mac voi Apple Silicon.

So sanh kien truc Whisper Large V3 Turbo va V3

V3 Turbo vs V3: Nhung gi da thay doi

Turbo khong phai la kien truc moi. Day chinh xac la mo hinh Whisper Large-v3 voi decoder duoc cat giam tu 32 lop xuong 4, sau do duoc fine-tune de phuc hoi do chinh xac. Encoder khong bi thay doi.

Large-v3 Turbo Large-v3
Tham so 809M 1,550M
Lop decoder 4 32
Ngon ngu 99 99
Tac vu dich Khong ho tro Ho tro
Giay phep MIT Apache 2.0

Tac vu dich duoc loai tru ro rang khoi du lieu huan luyen cua Turbo. Mo hinh Large-v3 day du ho tro tinh nang nay, nhung Whisper Notes chi su dung Turbo — viec dich duoc xu ly rieng thong qua Apple Intelligence.

Benchmark toc do: Whisper Notes tren Apple Silicon

Trong Whisper Notes cho Mac, Turbo chay thong qua CoreML tren Neural Engine. Xu ly 10 phut am thanh:

Thiet bi Whisper V3 V3 Turbo Tang toc
iPhone 15 Pro 425 s 82 s 5.2×
iPad Pro M2 380 s 71 s 5.4×
MacBook Pro M2 316 s 63 s 5.0×

Muc tang toc 5× nay la dac thu cho Whisper Notes tren Apple Silicon, noi decoder nho hon duoc huong loi tu toi uu hoa Neural Engine. Tren GPU voi cac framework nhu faster-whisper, khoang cach thu hep con ~2,7× (xem benchmark cong dong ben duoi).

Do chinh xac: So sanh WER

Hugging Face Open ASR Leaderboard thu nghiem ca hai mo hinh tren cung cac dataset tieng Anh. Word error rate cua Turbo nam trong nua diem so voi V3 tren moi benchmark:

Dataset V3 Turbo WER V3 WER
LibriSpeech Clean 2.10% 2.01%
LibriSpeech Other 4.24% 3.91%
GigaSpeech 10.14% 10.02%
Earnings22 11.63% 11.29%
AMI 16.13% 15.95%
WER trung binh 7.83% 7.44%

V3 chinh xac hon mot chut tren moi dataset, nhung su chenh lech la nho — trung binh 0,39 diem phan tram. Doi voi hau het cac ban phien am thuc te, ban se khong nhan thay su khac biet.

Tren danh gia YouTube-commons dang dai (mot trong nhung benchmark ASR ma nguon mo lon nhat), Turbo dat 13.40% WER so voi 13.20% cua V3 — trong khi chay o he so thoi gian thuc 129.5× so voi 55.3×. Do la nhanh hon 2,3 lan voi do chinh xac gan nhu tuong duong tren am thanh thuc te.

Benchmark cong dong: GPU va CPU

Cac benchmark doc lap tu cong dong faster-whisper va whisper.cpp cho thay ket qua nhat quan tren nhieu phan cung khac nhau. Phien am 13 phut am thanh voi faster-whisper tren GPU:

Mo hinh Do chinh xac Thoi gian Bo nho GPU WER
Large-v3 Turbo fp16 19.2 s 2,537 MB 1.92%
Large-v3 fp16 52.0 s 4,521 MB 2.88%
Large-v3 Turbo int8 19.6 s 1,545 MB 1.92%
Distil-Large-v3 fp16 26.1 s 2,409 MB 2.39%

Nguon: benchmark faster-whisper tren NVIDIA GPU, split xac nhan LibriSpeech clean. Turbo int8 chi su dung 1.5 GB VRAM — vua du cho GPU 2 GB.

Suy luan theo lo tren RTX 3060 Laptop (6 GB VRAM, do chinh xac int8) dua loi the di xa hon:

Mo hinh Tuan tu Theo lo (10) WER theo lo
Large-v3 Turbo 46.1 s 18.7 s 7.7%
Large-v3 230.8 s 43.0 s 7.9%
Large-v2 178.3 s 43.2 s 8.8%
Medium 113.3 s 26.3 s 8.9%

Nguon: benchmark NilaierMusic, Intel i7-12650H + RTX 3060 Laptop 6 GB, am thanh tieng Phap, do chinh xac int8.

Voi xu ly theo lo, Turbo dat WER tot nhat trong tat ca cac mo hinh duoc thu nghiem (7,7%) dong thoi la nhanh nhat. Day ro rang la diem toi uu cho viec su dung trong san xuat.

Han che da biet (va cach Whisper Notes xu ly chung)

Khong co tinh nang dich tich hop

Turbo duoc huan luyen khong co du lieu dich. No chi phien am bang ngon ngu goc — khong giong nhu Large-v3, ho tro dich am thanh sang tieng Anh.

Whisper Notes — Apple Intelligence tu dong dich ban phien am sang ngon ngu ban chon, cho ban dau ra song ngu bat ke ban su dung mo hinh nao.

Nhieu ao giac hon voi am thanh nhieu tap

Bao cao tu cong dong cho thay Turbo ao giac nhieu hon tren cac doan clip rat ngan hoac ban ghi nhieu tap so voi V3. Dieu nay duoc du kien voi decoder nho hon (4 lop so voi 32).

Whisper Notes — chay Pyannote VAD truoc khi phien am, phat hien cac doan co giong noi va loai bo im lang/tap am de mo hinh chi xu ly giong noi thuc.

Ban nen su dung mo hinh nao?

Tieng Anh / Chau Au Parakeet V3 — nhanh hon Whisper 10 lan, do chinh xac tot hon
Tieng Trung / Nhat / Han SenseVoice — duoc xay dung chuyen cho CJK, toc do 52×
Ngon ngu khac Whisper Large V3 Turbo — 99 ngon ngu, do chinh xac cao, cham hon