Guide · Updated June 2026

Whisper model sizes: tiny vs base vs small vs medium vs large.

A practical guide to choosing OpenAI Whisper checkpoints for local transcription, podcasts, meetings, subtitles, multilingual speech recognition, and faster-whisper deployment workflows.

Quick recommendation

Most builders should start with Whisper small or medium, then compare against large-v3 or large-v3-turbo only when accuracy gains justify the extra compute. Use tiny or base for quick tests. Use faster-whisper when runtime efficiency, batching, quantization, or deployment packaging matters.

Whisper model size comparison

ModelApproximate sizeBest fitPractical note
Whisper tinyApprox. 39M parametersFast tests, quick drafts, constrained hardwareLowest quality tier; useful for rough notes or experiments.
Whisper baseApprox. 74M parametersLightweight local transcription and demosA small step up from tiny while staying easy to run.
Whisper smallApprox. 244M parametersPractical local transcription on many desktopsOften a good first serious local test before moving to medium or large.
Whisper mediumApprox. 769M parametersBetter accuracy when runtime is acceptableGood middle ground for multilingual or noisier audio when hardware allows.
Whisper large / large-v2 / large-v3Approx. 1.55B parametersHigher-accuracy transcription and multilingual workHeavier model family; test latency and memory before production use.
Whisper large-v3-turboLarge-v3-derived turbo checkpointFaster high-quality transcription when supportedVerify current model card and runtime support; speed depends on backend and hardware.
faster-whisper runtimesRuntime path, not a separate OpenAI model sizeOptimized local and server transcriptionUses CTranslate2 and quantization options; benchmark with your real audio.

Exact memory use depends on implementation, compute type, quantization, audio length, language, batch size, timestamp settings, decoding options, and GPU/CPU backend. Verify the current model card and runtime docs.

How to choose by workflow

Choose tiny or base

Use tiny or base for quick experiments, low-resource devices, rough meeting notes, or early app prototypes where speed matters more than final transcript quality.

Choose small or medium

Use small or medium when you want a practical local transcription setup for podcasts, interviews, meetings, and internal media without jumping straight to the largest checkpoint.

Choose large-v3 or large-v3-turbo

Use large-family checkpoints when accuracy matters and hardware/runtime support is available. Large-v3-turbo can be attractive when a supported runtime provides better speed for the workload.

Choose faster-whisper

Use faster-whisper when deployment efficiency matters. It is a runtime/implementation choice that can make Whisper-family transcription more practical through CTranslate2 and quantization options.

Hardware and runtime notes

  • CPU-only transcription can work for smaller models, but long files may be slow.
  • NVIDIA GPUs, Apple silicon, and optimized runtimes can substantially change speed and practicality.
  • Quantized faster-whisper deployments can reduce memory needs, but should be tested for quality and timestamp behavior.
  • Batching helps throughput for server workloads, while single-file latency matters more for desktop transcription.
  • For production use, measure real audio: accents, noise, overlapping speech, domain vocabulary, and file duration matter more than generic benchmarks.

Accuracy and review caveats

Whisper is strong for many languages and noisy real-world audio, but no speech model is perfect. Important transcripts still need review, especially in medical, legal, compliance, finance, or customer-facing workflows. Watch for inserted words, wrong names, punctuation mistakes, missed speakers, timestamp drift, and language confusion.

Related OpenSourcesAI pages

FAQ

Which Whisper model size should I start with?

Start with small if your machine can run it comfortably. Use base for very light hardware and move to medium or large-v3 only after you know your audio quality, language mix, and latency needs.

Is Whisper large-v3 always the best choice?

No. It can be more accurate, but it is heavier. For many podcasts, meetings, and internal workflows, small, medium, or large-v3-turbo may be a better speed/quality balance depending on runtime and hardware.

Does faster-whisper change model accuracy?

faster-whisper is an optimized implementation path. Accuracy and speed depend on the selected checkpoint, compute type, quantization, hardware, batching, audio quality, and decoding settings.

Can Whisper hallucinate text?

Yes. Like other speech models, Whisper can produce wrong or inserted text, especially with noisy audio, long silences, overlapping speech, music, accents, or unsupported language conditions. Important transcripts still need review.

Sources

Next step

Pick one short audio file, test two Whisper sizes, and log speed, accuracy, timestamp quality, and editing time. The best model is the one that improves the transcript enough to justify its runtime cost.