Guide · Updated June 2026
Whisper model sizes: tiny vs base vs small vs medium vs large.
A practical guide to choosing OpenAI Whisper checkpoints for local transcription, podcasts, meetings, subtitles, multilingual speech recognition, and faster-whisper deployment workflows.
Quick recommendation
Most builders should start with Whisper small or medium, then compare against large-v3 or large-v3-turbo only when accuracy gains justify the extra compute. Use tiny or base for quick tests. Use faster-whisper when runtime efficiency, batching, quantization, or deployment packaging matters.
Whisper model size comparison
| Model | Approximate size | Best fit | Practical note |
|---|---|---|---|
| Whisper tiny | Approx. 39M parameters | Fast tests, quick drafts, constrained hardware | Lowest quality tier; useful for rough notes or experiments. |
| Whisper base | Approx. 74M parameters | Lightweight local transcription and demos | A small step up from tiny while staying easy to run. |
| Whisper small | Approx. 244M parameters | Practical local transcription on many desktops | Often a good first serious local test before moving to medium or large. |
| Whisper medium | Approx. 769M parameters | Better accuracy when runtime is acceptable | Good middle ground for multilingual or noisier audio when hardware allows. |
| Whisper large / large-v2 / large-v3 | Approx. 1.55B parameters | Higher-accuracy transcription and multilingual work | Heavier model family; test latency and memory before production use. |
| Whisper large-v3-turbo | Large-v3-derived turbo checkpoint | Faster high-quality transcription when supported | Verify current model card and runtime support; speed depends on backend and hardware. |
| faster-whisper runtimes | Runtime path, not a separate OpenAI model size | Optimized local and server transcription | Uses CTranslate2 and quantization options; benchmark with your real audio. |
How to choose by workflow
Choose tiny or base
Use tiny or base for quick experiments, low-resource devices, rough meeting notes, or early app prototypes where speed matters more than final transcript quality.
Choose small or medium
Use small or medium when you want a practical local transcription setup for podcasts, interviews, meetings, and internal media without jumping straight to the largest checkpoint.
Choose large-v3 or large-v3-turbo
Use large-family checkpoints when accuracy matters and hardware/runtime support is available. Large-v3-turbo can be attractive when a supported runtime provides better speed for the workload.
Choose faster-whisper
Use faster-whisper when deployment efficiency matters. It is a runtime/implementation choice that can make Whisper-family transcription more practical through CTranslate2 and quantization options.
Hardware and runtime notes
- CPU-only transcription can work for smaller models, but long files may be slow.
- NVIDIA GPUs, Apple silicon, and optimized runtimes can substantially change speed and practicality.
- Quantized faster-whisper deployments can reduce memory needs, but should be tested for quality and timestamp behavior.
- Batching helps throughput for server workloads, while single-file latency matters more for desktop transcription.
- For production use, measure real audio: accents, noise, overlapping speech, domain vocabulary, and file duration matter more than generic benchmarks.
Accuracy and review caveats
Whisper is strong for many languages and noisy real-world audio, but no speech model is perfect. Important transcripts still need review, especially in medical, legal, compliance, finance, or customer-facing workflows. Watch for inserted words, wrong names, punctuation mistakes, missed speakers, timestamp drift, and language confusion.
Related OpenSourcesAI pages
FAQ
Which Whisper model size should I start with?
Start with small if your machine can run it comfortably. Use base for very light hardware and move to medium or large-v3 only after you know your audio quality, language mix, and latency needs.
Is Whisper large-v3 always the best choice?
No. It can be more accurate, but it is heavier. For many podcasts, meetings, and internal workflows, small, medium, or large-v3-turbo may be a better speed/quality balance depending on runtime and hardware.
Does faster-whisper change model accuracy?
faster-whisper is an optimized implementation path. Accuracy and speed depend on the selected checkpoint, compute type, quantization, hardware, batching, audio quality, and decoding settings.
Can Whisper hallucinate text?
Yes. Like other speech models, Whisper can produce wrong or inserted text, especially with noisy audio, long silences, overlapping speech, music, accents, or unsupported language conditions. Important transcripts still need review.
Sources
Next step
Pick one short audio file, test two Whisper sizes, and log speed, accuracy, timestamp quality, and editing time. The best model is the one that improves the transcript enough to justify its runtime cost.