Hardware tier · 12GB VRAM · Reviewed June 2026

What Can 12GB VRAM Run? Local AI at the 12GB Tier

12GB of GPU VRAM is the most common capable local AI tier. It is enough to run 7B and 8B models at Q8 quality and 13B–14B models at Q4 — covering the majority of practical open-weight model use cases. It cannot run 30B+ models at interactive speed without CPU offload, and it cannot run any model at FP16 above the 7B size class. This guide covers exactly what fits, which GPUs offer this tier, and when 24GB makes sense.

Model fit at 12GB VRAM

The table below applies to any GPU with 12GB of VRAM. Individual cards differ in token generation speed (memory bandwidth), but the model ceiling is the same across all 12GB GPUs.

Model sizeBest quantizationVRAM usedFits in 12GB?Notes
1B–4BFP16, Q8, or Q40.7–8 GBYesAny quantization. Good for fast tools and experimentation.
7BQ8 (best on 12GB)~7.7 GBYesQ8 fits with ~4GB headroom. FP16 requires 14GB — does not fit.
8BQ8~8.7 GBYesLlama 3 8B, Gemma 3 8B. Q8 fits with ~3GB headroom.
13BQ4_K_M~7.9 GBYesQ4 fits. Q8 (~14 GB) does not fit at 12GB.
14BQ4_K_M~8.4 GBYesQwen 2.5 14B, Phi-4. Q4 fits with ~3.5 GB headroom.
30B–32BQ4~18–20 GBCPU offload onlyExceeds 12 GB. RAM offload runs at 1–5 t/s. Needs 24GB+ for interactive use.
70BQ4~38–42 GBNoFar exceeds 12 GB. Needs cloud or workstation hardware.

What 12GB VRAM enables

  • 7B and 8B at Q8: The best quality level available at 12GB for these model sizes. Q8 is very close to FP16 quality. Runs at 15–40 tokens per second depending on the GPU.
  • 14B at Q4_K_M: Larger model depth at reduced precision. Good for tasks where model knowledge matters more than maximum quantization quality.
  • Fast 4B models at any quant: FP16 on 4B models fits well under 12GB. Very fast inference for tools, agents, and autocomplete.

What 12GB VRAM cannot do

  • 7B or 8B at FP16: FP16 on 7B needs ~14GB. FP16 on 8B needs ~16GB. Both exceed 12GB. Q8 is the best available quality tier at this VRAM level.
  • 13B or 14B at Q8: Q8 on 13B requires ~14GB. Q8 on 14B requires ~15GB. Both exceed 12GB. Q4 is the best available for 13B+ on this tier.
  • 30B+ at interactive speed: Q4 on 30B models needs 18–20GB. CPU offload is possible but generates at 1–5 tokens per second — not suitable for interactive chat.

GPUs at the 12GB tier

All 12GB GPUs have the same model ceiling. The meaningful difference between them is memory bandwidth, which determines how many tokens per second you get at each model size.

GPUBandwidthArchitectureNotes
RTX 3060 12GB360 GB/sAmpereBudget entry. Same model ceiling, lower bandwidth (~30% fewer t/s than 4070 Ti).
RTX 4070 Ti 12GB504 GB/sAda LovelaceProsumer. Higher bandwidth and newer architecture — fastest 12GB option.
RTX 3080 12GB912 GB/sAmpereHigh bandwidth at 12GB. Less common than the 3090 but a strong used-market option.
RTX 4070 12GB504 GB/sAda LovelaceSimilar to 4070 Ti at 12GB tier. Strong for general local AI use.

Rule of thumb: more bandwidth = more tokens per second for the same model. The RTX 4070 Ti at 504 GB/s generates roughly 30–40% more tokens per second than the RTX 3060 at 360 GB/s on the same model and quantization level.

When to upgrade to 24GB

The 24GB tier unlocks three things the 12GB tier cannot do:

  • 7B and 8B at FP16: Full precision inference with no quality loss from quantization.
  • 13B and 14B at Q8: Near-lossless quality on larger models.
  • 30B and 32B at Q4: Model depth that requires 18–20GB of VRAM — genuinely impossible at 12GB.

If your daily workflows fit within 7B Q8 or 14B Q4, the 12GB tier is sufficient. The upgrade to 24GB is most justified when you regularly hit the 13B Q4 quality ceiling and want Q8, or when 30B model depth is a hard requirement.

Recommended first setup

On any 12GB GPU, start with Ollama and a 7B or 8B model at Q4_K_M to verify the stack works, then upgrade to Q8 for better quality:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a 7B model at Q4 first (fast download, confirms CUDA is working)
ollama pull qwen3:8b

# Once working, try Q8 for better quality (~8.7 GB VRAM)
ollama pull qwen3:8b-q8_0
ollama run qwen3:8b-q8_0

Check your specific GPU against models

Enter your exact VRAM amount and workflow into the compatibility checker to get model recommendations matched to your hardware.