Foundations · Hardware
What Is Quantization?
Quantization compresses a model's internal numbers from 16-bit floating point down to 4 or 8 bits. The result: the same model uses roughly half the GPU memory, with only a small drop in output quality. It is the single most important technique for running capable AI on consumer hardware.
Editorial review
AI tools, model releases, pricing, licenses, and platform terms can change quickly. Verify the official source before production or commercial use.
Why model weights take so much memory
A language model is, at its core, a very large collection of numbers called weights. These weights encode everything the model learned during training. A 7-billion-parameter model has 7,000,000,000 individual weights.
In full precision (FP16), each weight is stored as a 16-bit floating point number — 2 bytes. That makes a 7B model roughly 14 GB before any runtime overhead. Most consumer GPUs have 8–16 GB of VRAM, so a full-precision 7B model simply will not fit.
Quantization solves this by reducing the number of bits used to store each weight. At 4-bit precision, each weight takes 0.5 bytes instead of 2 — a 4× size reduction. A 7B model at 4-bit quantization fits in roughly 4.5–5 GB of VRAM.
The quantization tradeoff: size vs quality
Reducing bit depth means some information is lost. Weights that were precise decimal values get rounded to the nearest representable value in the lower-precision format. The question is how much that rounding hurts outputs.
In practice, well-implemented quantization at 4-bit introduces only a small quality drop because most of the information in a trained model is concentrated in a relatively small fraction of the weight values. Modern quantization methods (like K-quants used in GGUF files) are calibration-aware — they identify the most important weights and preserve them more precisely, distributing the error where it matters least.
| Format | Bits per weight | 7B model VRAM | Quality vs FP16 |
|---|---|---|---|
fp16 | 16 | ~14 GB | Reference (no loss) |
q8_0 | 8 | ~7.5 GB | Nearly identical |
q4_k_m | 4 (K-quant) | ~4.5 GB | Small drop, generally acceptable |
q3_k_m | 3 | ~3.5 GB | Noticeable degradation |
q2_k | 2 | ~2.8 GB | Significant quality loss |
How to choose a quantization level
The practical decision comes down to your VRAM and the task you are running.
- 8 GB VRAM (RTX 3070, 4060): Use Q4_K_M for 7–8B models. This is the best fit — gives you a capable model without overflow to system RAM.
- 10–12 GB VRAM (RTX 3080, 4070): Use Q8_0 for 7B models, or Q4_K_M for 13B models. More room means better quality at the same parameter count.
- 16 GB VRAM (RTX 4080, 3090): Run 13B at Q8_0, or try 30B models at Q4_K_M.
- 24 GB VRAM (RTX 4090, 3090): 34B at Q4_K_M fits. 70B models require CPU offload or a dual-GPU setup.
- Apple Silicon (unified memory): M3 Max with 64 GB can run 70B at Q4_K_M fully in memory. The unified pool is the significant advantage here.
When in doubt, use Q4_K_M. It is the community default for a reason: the size savings are large, the quality tradeoff is small for most tasks, and it is what most model pages on Hugging Face and Ollama default to.
Quantization in Ollama: what you actually do
If you use Ollama, you almost never need to think about quantization directly. When you run:
ollama pull llama3.2:8bOllama automatically downloads a pre-quantized GGUF file — usually Q4_K_M. The quantization decision has already been made for you by the model publisher.
To select a specific quantization level, append the tag:
ollama pull llama3.2:8b-instruct-q8_0Available quantization tags for each model are listed on ollama.com/library. If you are unsure, pull the default and see how it performs. You can always pull a different quantization level later and delete the one you do not need.
For builders
Built an AI tool or open-source project?
Submit it for review or sponsor a featured placement on OpenSourcesAI. For sponsorships, email sponsors@opensourcesai.com. For submissions or corrections, use the submit page.
Frequently asked questions
What does Q4_K_M actually mean?
Q4 means 4-bit quantization — weights are stored in 4 bits instead of 16. K_M is the quantization method (K-quants, medium variant) developed by llama.cpp. K-quants are smarter than naive integer quantization: they group weights into blocks and choose scale factors per block, preserving quality better than a flat 4-bit conversion. Q4_K_M is the most common format you will encounter and the default recommendation for consumer hardware.
How much VRAM does quantization save?
At FP16 (no quantization), a 7B model needs roughly 14 GB of VRAM. At Q4_K_M, that drops to about 4.5–5 GB. At Q8_0, it lands around 7.5 GB. The savings scale with model size: a 13B FP16 model needs ~26 GB, but at Q4_K_M fits in 8–9 GB. This is why quantization unlocks consumer hardware like RTX 3070 (8 GB) for genuinely capable models.
Does quantization hurt quality?
It depends on the quantization level and the task. Q8_0 is nearly indistinguishable from FP16 on most benchmarks. Q4_K_M shows a small but measurable drop on complex reasoning, instruction-following precision, and math — but is acceptable for chat, summarisation, and RAG retrieval. Q2 and Q3 models show significant degradation and are generally not recommended for practical use.
Should I use Q4 or Q8?
Use Q4_K_M if your VRAM is the constraint. Use Q8_0 if you have the headroom — the quality difference is meaningful for technical or reasoning tasks. For a 7B model, Q8 requires roughly 7–8 GB VRAM versus 4–5 GB for Q4. If your GPU has 10 GB+, Q8_0 of a 7B model is a straightforward choice. If you are on 8 GB VRAM, Q4_K_M of a 7–8B model is the standard recommendation.
What is the difference between GGUF and GPTQ quantization?
GGUF (used by llama.cpp and Ollama) runs on both CPU and GPU and supports flexible mixed-precision K-quants. GPTQ is calibration-based quantization targeted at NVIDIA GPUs, typically used with ExLlamaV2 or vLLM. For most local AI beginners using Ollama or LM Studio, GGUF Q4_K_M is the right starting format. GPTQ becomes relevant if you run on a dedicated NVIDIA GPU with vLLM or need maximum throughput.
Does Ollama handle quantization automatically?
Yes — when you run `ollama pull llama3.2:8b`, Ollama downloads a pre-quantized GGUF file (usually Q4_K_M). You do not need to quantize anything yourself. You can select a specific quantization level by appending the tag: `ollama pull llama3.2:8b-instruct-q8_0`. The model library page on Ollama.com lists available tags for each model.