Hardware tier · 24GB VRAM · Reviewed June 2026

What Can 24GB VRAM Run? Local AI at the 24GB Tier

24GB is the current consumer ceiling for discrete GPU VRAM — available in the RTX 3090 and RTX 4090. It unlocks three things the 12GB tier cannot do: 7B and 8B models at full FP16 precision, 13B and 14B models at Q8, and 30B–32B models at Q4. What it cannot do is run 70B models — that requires 48GB+ of VRAM or Apple Silicon unified memory.

Model fit at 24GB VRAM

The table below applies to any GPU with 24GB of dedicated VRAM. The RTX 3090 and RTX 4090 share this ceiling — bandwidth differences affect generation speed, not which models fit.

Model sizeBest quantizationVRAM usedFits in 24GB?Notes
1B–4BFP162–8 GBYesFull precision on all small models.
7B–8BFP16~14–16 GBYesKey upgrade over 12GB. Full precision with 8–10 GB of headroom.
13BQ8~14 GBYesNear-lossless Q8 quality with ~10 GB headroom. FP16 (~26 GB) does not fit.
14BQ8~15 GBYesQwen 2.5 14B, Phi-4 at Q8. Very close to FP16 quality.
30B–32BQ4_K_M~18–20 GBYesThe headline 24GB unlock. Impossible on 12GB cards at interactive speed.
70BQ4_K_M~38–42 GBNoExceeds 24 GB. Requires 48GB+ (workstation GPU, dual-3090 NVLink, or Apple Silicon 64GB).

What 24GB VRAM unlocks over 12GB

  • 7B and 8B at FP16: The biggest practical upgrade. Full precision on 7B and 8B models — no quantization quality tradeoff. 7B FP16 uses ~14GB with 10GB of headroom. 8B FP16 uses ~16GB with 8GB of headroom.
  • 13B and 14B at Q8: On 12GB cards, 13B models require Q4 quantization. On 24GB, they run at Q8 — very close to FP16 quality. This is a meaningful quality improvement for instruction following, coding, and reasoning tasks.
  • 30B–32B at Q4: The high-end unlock. A 32B model at Q4 uses approximately 20GB and fits with 4GB headroom. This model class requires CPU offload on 12GB cards and runs at interactive speed on 24GB GPUs.

What 24GB VRAM still cannot do

  • 70B models at usable quality: Q4 on 70B needs ~38–42GB. Even at the aggressive Q2 quantization level (~23GB), quality degrades significantly. 70B at Q4 requires Apple Silicon 64GB unified memory, a dual-RTX-3090 NVLink setup, or cloud inference.
  • 13B or 14B at FP16: FP16 on 13B needs ~26GB. FP16 on 14B needs ~28GB. Both exceed 24GB. Q8 is the practical maximum quality for 13B+ on this tier.

GPUs at the 24GB tier

GPUBandwidthArchitectureNotes
RTX 3090 24GB936 GB/sAmpereUsed market value. NVLink supported (last consumer card with it). ~8% slower than 4090.
RTX 4090 24GB1008 GB/sAda LovelaceFastest consumer GPU. Maximum bandwidth and Ada Lovelace compute efficiency.

Both cards have the same 24GB ceiling. The RTX 4090's bandwidth advantage (1008 vs 936 GB/s) produces roughly 8–10% more tokens per second at the same model size. On the used market, the RTX 3090 often represents strong value if the bandwidth difference is not a priority.

24GB vs Apple Silicon 64GB

Apple Silicon with 64GB of unified memory is the only consumer alternative that changes the model ceiling beyond 32B. The tradeoff is different in both directions:

Metric24GB NVIDIA (RTX 4090)Apple Silicon 64GB
Max model (comfortable)32B Q470B Q4
7B generation speed~40–80 t/s~15–25 t/s
14B FP16No (26 GB)Yes (~28 GB fits)
CUDA runtimes (vLLM)YesNo (Metal only)
OSWindows / LinuxmacOS only

If 70B models or 14B FP16 are hard requirements, Apple Silicon 64GB is the right architecture. If fast inference on 7B–32B models is the priority, a 24GB NVIDIA card is faster for those sizes.

When to upgrade beyond 24GB

The jump from 24GB to the next meaningful tier (48GB+) is large in cost and complexity. Worth considering only if:

  • 70B models at Q4 quality are a regular requirement — not occasional curiosity.
  • 13B+ at FP16 is a hard quality requirement — and Q8 quality is not acceptable.
  • Multi-user serving needs 32B+ models — requiring workstation GPUs (RTX A6000 48GB) or a multi-GPU node.

For most local AI workflows through 32B models, 24GB is the practical consumer ceiling that does not need upgrading.

Recommended first setup at 24GB

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a 7B model at FP16 — the key unlock vs 12GB
ollama pull qwen3:8b-fp16
ollama run qwen3:8b-fp16

# Or go straight to 32B at Q4 — impossible on 12GB
ollama pull qwen3:32b
ollama run qwen3:32b

Check your specific GPU against models