Guide · Hardware · Reviewed June 2026

What Is VRAM and How Much Do You Need for Local AI?

VRAM is the single most important hardware constraint for running local LLMs. It determines which model sizes you can run at full GPU speed, how much context you can process, and how much quantization you need to apply. This guide explains the mechanics and gives you the tables you need to plan before downloading anything.

Editorial review

Reviewed byOpenSourcesAI EditorialLast updatedJune 2026SourcesGGUF quantization documentation (llama.cpp), Hugging Face model cards, Ollama model library size data, GPU manufacturer specs, and OpenSourcesAI editorial review.

AI tools, model releases, pricing, licenses, and platform terms can change quickly. Verify the official source before production or commercial use.

What VRAM is, in plain English

VRAM (Video RAM) is the dedicated memory built into your GPU. It is separate from your computer's system RAM. When a local LLM runs on GPU, the model weights — the billions of numerical parameters that define what the model knows — must be loaded into VRAM before any text can be generated.

If the model fits entirely in VRAM, generation is fast: typically 15–50 tokens per second on consumer GPUs. If the model is too large to fit, the runtime offloads layers to system RAM and runs them on the CPU instead. This is called CPU offload. It works, but it can be 5–20 times slower than full GPU inference.

This is why VRAM is the number you need to know before choosing a model. It is not about total system RAM, not about internet speed, and not about CPU speed for most inference workloads. It is about how many gigabytes your GPU has.

How to check your VRAM

Run one of these commands to see how much VRAM your GPU has:

# Windows — PowerShell
(Get-WmiObject Win32_VideoController).AdapterRAM / 1GB

# Windows — NVIDIA only
nvidia-smi --query-gpu=memory.total --format=csv,noheader

# Linux
nvidia-smi

# macOS (Apple Silicon — shows total unified memory, not separate VRAM)
system_profiler SPDisplaysDataType | grep "Total Number"

On Windows, you can also open Task Manager → Performance → GPU and read the Dedicated GPU Memory value. On NVIDIA systems, nvidia-smi is the authoritative source.

The VRAM formula for model weights

The minimum VRAM to load model weights follows a straightforward formula:

VRAM for weights (GB) ≈ (parameters_in_billions × bits_per_parameter) / 8

Examples:
  7B model at FP16 (16 bits):  (7 × 16) / 8  = 14 GB
  7B model at Q8  ( 8 bits):   (7 ×  8) / 8  =  7 GB
  7B model at Q4  ( 4 bits):   (7 ×  4) / 8  =  3.5 GB (+ ~0.5 GB overhead ≈ 4.1 GB)
 14B model at Q4  ( 4 bits):  (14 ×  4) / 8  =  7 GB   (+ overhead ≈ 8.4 GB)
 70B model at Q4  ( 4 bits):  (70 ×  4) / 8  = 35 GB   (+ overhead ≈ 38–42 GB)

Add 10–20% for runtime overhead, the KV cache, and activations. In practice, always leave at least 1–2 GB of VRAM headroom beyond the raw weight estimate. A model that exactly fills VRAM will likely crash or throttle under load.

KV cache: the hidden VRAM cost

Beyond the model weights, the KV cache consumes additional VRAM. The KV cache stores intermediate attention computation results so the model does not have to reprocess past tokens on every generation step.

KV cache size (bytes) =
  2 × n_layers × n_kv_heads × head_dim × context_length × bytes_per_element

Where:
  2           = one key tensor + one value tensor
  n_kv_heads  = number of KV heads (reduced in GQA models — often 8 for 7B models)
  head_dim    = per-head hidden dimension (typically 128)
  bytes       = 2 for FP16, 1 for INT8

Example — Llama 3 8B at 4K context (FP16 KV):
  2 × 32 layers × 8 KV heads × 128 × 4096 tokens × 2 bytes
  = ~0.54 GB of KV cache

Example — same model at 32K context:
  2 × 32 × 8 × 128 × 32768 × 2
  = ~4.3 GB of KV cache

At short context lengths (2K–4K), KV cache overhead is modest. At long context (16K–128K), it becomes a major budget item. If you plan to use large context windows, factor KV cache into your VRAM plan before choosing a model or quantization level.

Total VRAM budget estimate

Total VRAM ≈ model_weights + kv_cache + runtime_overhead

Conservative planning formula:
  Total ≈ (params_B × bits_per_param / 8) × 1.15  +  kv_cache_GB  +  1 GB

Example — 14B model, Q4_K_M, 8K context:
  Weights:  14 × 4 / 8 = 7 GB × 1.15 ≈ 8.1 GB
  KV cache: ~1 GB (at 8K context, GQA model)
  Overhead: 1 GB
  Total:    ~10.1 GB → needs 12 GB VRAM

Quantization formats and quality tradeoffs

Quantization reduces the numerical precision of model weights to lower memory use. GGUF format (used by Ollama, LM Studio, and llama.cpp) supports a range of quantization levels. For most beginners, Q4_K_M is the best starting point.

Format	Bits / param	Bytes / param	Quality note
FP32	32	4	Exact training precision. Rarely used for inference — 4× heavier than FP16.
BF16 / FP16	16	2	Near-lossless. Standard for fast GPU inference when VRAM allows.
Q8_0 (INT8)	8	1	Very close to FP16 quality. Good balance of size and accuracy.
Q6_K	6	~0.75	High quality, moderately compact. Useful when Q8 is too large.
Q5_K_M	5	~0.625	Good quality. Sits between Q4 and Q8 for users who want more headroom.
Q4_K_M	4	~0.5	Most common beginner choice. Acceptable quality, roughly half the memory of FP16.
Q4_K_S	4	~0.45	Slightly smaller than Q4_K_M with a marginal quality reduction.
Q3_K_M	3	~0.375	Noticeable quality drop on most models. Use only when VRAM is very constrained.
Q2_K	2	~0.25	Significant degradation. Emergency option for very low VRAM environments.

The _K_M and _K_S suffixes in GGUF names refer to the k-quant method variant. K-quants use a per-block mixed precision approach that improves quality at the same bit count compared to older uniform quants. Prefer Q4_K_M over Q4_0when both are available.

Model size vs VRAM reference table

Approximate VRAM needed per model size at common quantization levels, including a typical runtime overhead buffer. Actual values vary by model architecture, context window, and runtime. Always verify with the model card and your target runner.

Model size	Q4_K_M	Q8_0	FP16	Notes
1B	~0.7 GB	~1.2 GB	~2 GB	Runs on almost any hardware. Useful for edge and embedded.
3B	~2 GB	~3.5 GB	~6 GB	Fits in 4–6 GB VRAM. Practical for fast chat on low-end GPUs.
7B	~4.1 GB	~7.7 GB	~14 GB	The most common starting size. Q4 fits in 6–8 GB VRAM comfortably.
8B	~4.9 GB	~8.7 GB	~16 GB	Llama 3 8B and Gemma 3 8B. Slightly larger than classic 7B.
13B	~7.9 GB	~14 GB	~26 GB	Q4 fits in 12 GB VRAM. FP16 needs a 3090 or larger.
14B	~8.4 GB	~15 GB	~28 GB	Qwen2.5 14B, Phi-4. Q4 fits in 12 GB VRAM.
30B / 32B	~18–20 GB	~34 GB	~60 GB	Q4 requires 24 GB VRAM or dual-GPU setup.
70B	~38–42 GB	~74 GB	N/A (consumer)	Q4 requires 48 GB VRAM or multi-GPU. Single RTX 4090 cannot hold it fully.

Hardware tiers: what fits where

VRAM	Models that fit (Q4)	Example GPUs	Practical note
4 GB	1B–3B (Q4)	GTX 1650, GTX 1060 6GB	Proof-of-concept only. Useful for understanding the workflow, not daily use.
6 GB	3B–4B (Q4), 7B (partial)	RTX 3060 6GB, RTX 2060	Light daily use with small models. 7B models may partially offload to CPU.
8 GB	7B–8B (Q4)	RTX 3070, RTX 4060	Practical starting point. 7B Q4 fits with context headroom. Most common tier.
12 GB	7B (Q8), 13B–14B (Q4)	RTX 3080 12GB, RTX 4070	Good everyday tier. Enables Q8 for 7B and Q4 for 13B-class models.
16 GB	13B (Q8), 30B (Q4, tight)	RTX 4080, RTX 3090 Ti	Solid builder tier. 30B at Q4 may need a context window limit.
24 GB	30B–32B (Q4), 13B (FP16)	RTX 3090, RTX 4090	Current consumer high-end. 32B Q4 fits well. 70B requires CPU offload.
48 GB	70B (Q4)	RTX 6000 Ada, A40	Professional workstation. 70B Q4 fits with context headroom.
80 GB+	70B (Q8 / FP16)	A100, H100	Data centre class. FP16 inference on 70B+ models.

Apple Silicon: unified memory explained

Apple's M-series chips use a unified memory architecture: CPU and GPU share the same physical memory pool rather than having separate DRAM chips. For local AI, this means:

A MacBook Pro M3 Max with 64 GB of unified memory can load models far larger than a 24 GB NVIDIA GPU allows.
Memory bandwidth is the main constraint, not total capacity. Apple Silicon has very high bandwidth (~800 GB/s on M3 Max) but it is still lower than high-end NVIDIA GPUs at dedicated tasks.
Throughput per GB of memory is generally lower than dedicated VRAM — Apple Silicon generates fewer tokens per second per GB than an equivalent NVIDIA setup.
The larger addressable pool is still a real advantage for fitting large models that simply cannot run on consumer NVIDIA GPUs.
Tools like Ollama and LM Studio support Apple Silicon via Metal acceleration on macOS.

What to do when you run out of VRAM

Use a smaller quantization level: switch from FP16 to Q8, or from Q8 to Q4_K_M.
Use a smaller model: a 7B Q4 running at 30 t/s beats a 14B model partially offloaded to CPU at 4 t/s for most interactive tasks.
Reduce context window: lower --ctx-size in Ollama or the context slider in LM Studio. Halving the context halves the KV cache budget.
Close VRAM-consuming apps: games, video software, and other ML processes compete for the same GPU memory.
Enable CPU offload deliberately: some runtimes let you specify how many layers run on GPU vs CPU. Partial offload is better than no GPU at all.
Consider a cloud GPU: for occasional large-model inference, renting a GPU via RunPod, Vast.ai, or similar services is often cheaper than upgrading hardware.

Quick hardware planning checklist

Check your GPU VRAM using nvidia-smi or Task Manager → GPU.
Find your target model's parameter count on its model card or Ollama library page.
Estimate weight VRAM: (params_B × bits / 8) × 1.15.
Add KV cache estimate for your planned context window.
Leave 1–2 GB headroom. If the total exceeds your VRAM, reduce quantization or choose a smaller model.
Test with Ollama or LM Studio. Check actual VRAM use with nvidia-smi dmon during inference.

Sources

llama.cpp — GGUF quantization reference Hugging Face GGUF format overview Ollama model library (size data per model)NVIDIA System Management Interface (nvidia-smi)GQA: Training Generalized Multi-Query Transformer Models (arXiv)

FAQ

What happens if my model is larger than my VRAM?

The runtime offloads layers to system RAM (CPU offload). This works but slows generation dramatically — often 5 to 20 times slower than full GPU inference. A model that barely fits VRAM may produce text at 1–3 tokens per second instead of 20–40.

Does system RAM matter for local LLMs?

Yes, when a model does not fit fully in VRAM. The offloaded layers run from system RAM via the CPU, so fast RAM (DDR5, high-frequency DDR4) and plenty of it (32–64 GB) reduces the speed penalty. For fully GPU-accelerated runs, system RAM matters mainly for the OS and other running apps.

Is Apple Silicon VRAM the same as NVIDIA VRAM?

Apple Silicon uses a unified memory architecture: the CPU and GPU share the same physical memory pool. This means a MacBook Pro M3 Max with 64 GB of unified memory can load much larger models than an NVIDIA GPU with 24 GB of dedicated VRAM. Effective throughput per GB is lower than dedicated VRAM but the larger addressable pool is a real advantage.

Does quantization hurt model quality?

It depends on the model, quantization level, and task. Q4_K_M on a strong 7B model is often acceptable for chat and summarisation but may show degradation on complex reasoning or precise instruction following. Q8 is very close to FP16 for most tasks. Q2 models lose significant quality on most benchmarks. Always test on your actual use case.

Can I run a 70B model on consumer hardware?

At Q4 quantization, a 70B model needs approximately 38–42 GB of VRAM. A single RTX 4090 (24 GB) cannot hold it fully. Options include: a dual-GPU setup, a workstation GPU like the RTX 6000 Ada (48 GB), cloud GPU rental, or CPU offload with fast system RAM and very slow generation speed.

What is KV cache and why does it use VRAM?

KV cache stores the key and value tensors computed during the attention step for each token in the context window. This avoids recomputing them on every generation step. Longer context windows require proportionally more KV cache VRAM. At 32K context, KV cache can add 1–4 GB to your VRAM budget depending on the model architecture.

Next step: check your hardware before choosing a model

Use the local LLM compatibility checker to match your GPU VRAM to specific model sizes and quantization levels with a single tool.

Check hardware fit Read: What is a local LLM?