Guide · Updated June 2026

Whisper model sizes: tiny vs base vs small vs medium vs large.

A practical guide to choosing OpenAI Whisper checkpoints for local transcription, podcasts, meetings, subtitles, multilingual speech recognition, and faster-whisper deployment workflows.

Quick recommendation

Most builders should start with Whisper small or medium, then compare against large-v3 or large-v3-turbo only when accuracy gains justify the extra compute. Use tiny or base for quick tests. Use faster-whisper when runtime efficiency, batching, quantization, or deployment packaging matters.

Whisper model size comparison

Model	Approximate size	Best fit	Practical note
Whisper tiny	Approx. 39M parameters	Fast tests, quick drafts, constrained hardware	Lowest quality tier; useful for rough notes or experiments.
Whisper base	Approx. 74M parameters	Lightweight local transcription and demos	A small step up from tiny while staying easy to run.
Whisper small	Approx. 244M parameters	Practical local transcription on many desktops	Often a good first serious local test before moving to medium or large.
Whisper medium	Approx. 769M parameters	Better accuracy when runtime is acceptable	Good middle ground for multilingual or noisier audio when hardware allows.
Whisper large / large-v2 / large-v3	Approx. 1.55B parameters	Higher-accuracy transcription and multilingual work	Heavier model family; test latency and memory before production use.
Whisper large-v3-turbo	Large-v3-derived turbo checkpoint	Faster high-quality transcription when supported	Verify current model card and runtime support; speed depends on backend and hardware.
faster-whisper runtimes	Runtime path, not a separate OpenAI model size	Optimized local and server transcription	Uses CTranslate2 and quantization options; benchmark with your real audio.

Exact memory use depends on implementation, compute type, quantization, audio length, language, batch size, timestamp settings, decoding options, and GPU/CPU backend. Verify the current model card and runtime docs.

How to choose by workflow

Choose tiny or base

Use tiny or base for quick experiments, low-resource devices, rough meeting notes, or early app prototypes where speed matters more than final transcript quality.

Choose small or medium

Use small or medium when you want a practical local transcription setup for podcasts, interviews, meetings, and internal media without jumping straight to the largest checkpoint.

Choose large-v3 or large-v3-turbo

Use large-family checkpoints when accuracy matters and hardware/runtime support is available. Large-v3-turbo can be attractive when a supported runtime provides better speed for the workload.

Choose faster-whisper

Use faster-whisper when deployment efficiency matters. It is a runtime/implementation choice that can make Whisper-family transcription more practical through CTranslate2 and quantization options.

Hardware and runtime notes

CPU-only transcription can work for smaller models, but long files may be slow.
NVIDIA GPUs, Apple silicon, and optimized runtimes can substantially change speed and practicality.
Quantized faster-whisper deployments can reduce memory needs, but should be tested for quality and timestamp behavior.
Batching helps throughput for server workloads, while single-file latency matters more for desktop transcription.
For production use, measure real audio: accents, noise, overlapping speech, domain vocabulary, and file duration matter more than generic benchmarks.

Accuracy and review caveats

Whisper is strong for many languages and noisy real-world audio, but no speech model is perfect. Important transcripts still need review, especially in medical, legal, compliance, finance, or customer-facing workflows. Watch for inserted words, wrong names, punctuation mistakes, missed speakers, timestamp drift, and language confusion.

Related OpenSourcesAI pages

Whisper model family hub Whisper large-v3 model profile Whisper large-v3-turbo profile faster-whisper model profile AI tools directory Windows local AI setup guide

FAQ

Which Whisper model size should I start with?

Start with small if your machine can run it comfortably. Use base for very light hardware and move to medium or large-v3 only after you know your audio quality, language mix, and latency needs.

Is Whisper large-v3 always the best choice?

No. It can be more accurate, but it is heavier. For many podcasts, meetings, and internal workflows, small, medium, or large-v3-turbo may be a better speed/quality balance depending on runtime and hardware.

Does faster-whisper change model accuracy?

faster-whisper is an optimized implementation path. Accuracy and speed depend on the selected checkpoint, compute type, quantization, hardware, batching, audio quality, and decoding settings.

Can Whisper hallucinate text?

Yes. Like other speech models, Whisper can produce wrong or inserted text, especially with noisy audio, long silences, overlapping speech, music, accents, or unsupported language conditions. Important transcripts still need review.

Sources

OpenAI Whisper GitHub Robust Speech Recognition via Large-Scale Weak Supervision Whisper large-v3 model card Whisper large-v3-turbo model card faster-whisper GitHub

Next step

Pick one short audio file, test two Whisper sizes, and log speed, accuracy, timestamp quality, and editing time. The best model is the one that improves the transcript enough to justify its runtime cost.

Browse Whisper models Browse AI tools