Hardware · NVIDIA Ada Lovelace · Reviewed June 2026
RTX 4090 for Local LLMs: What 24GB VRAM Actually Unlocks
The RTX 4090 is the current consumer ceiling for local AI. Its 24GB of GDDR6X memory and 1008 GB/s of bandwidth — double what 12GB-class cards offer — change what is possible at the desktop level. It cannot run 70B models at usable quality, but it handles everything below that ceiling with speed and headroom that lower-VRAM cards cannot match.
Editorial review
GPU pricing changes frequently. This page covers the RTX 4090 (24 GB). The RTX 3090 has the same VRAM at lower bandwidth and is covered separately. Verify current pricing before purchasing.
Quick verdict
The RTX 4090 is the strongest single-GPU local AI card available to consumers. The 24GB VRAM ceiling opens three things that 12GB cards cannot do: 7B and 8B models at full FP16 precision, 13B and 14B models at Q8, and 30B and 32B models at Q4.
It is not a 70B card. A 70B model at Q4_K_M needs 38–42GB of VRAM, which exceeds 24GB. At extremely aggressive Q2 quantization, a 70B model barely fits but quality degrades significantly. If 70B models at reasonable quality are your primary use case, you need a workstation GPU with 48GB+ or cloud inference.
For everything up to and including 32B models, the RTX 4090 is the fastest and highest-quality consumer option. The 1008 GB/s memory bandwidth translates to roughly 2× the token generation speed of 12GB-class cards at comparable quantization levels.
What this hardware can run
The 24GB VRAM budget changes the quantization choices available at each model size. The key upgrade over 12GB cards is running 7B and 8B at FP16, 13B and 14B at Q8, and 30B+ models at all.
| Model size | Best quantization | VRAM used | Verdict | Notes |
|---|---|---|---|---|
| 1B–4B | FP16 | 2–8 GB | Comfortable | Full precision on all small models. Very fast generation. |
| 7B–8B | FP16 (key advantage) | ~14–16 GB | Comfortable | FP16 fits with 8–10 GB headroom. This is impossible on 12GB cards. Near-zero quality loss vs training precision. |
| 13B | Q8 | ~14 GB | Comfortable | Q8 fits with ~10 GB headroom. FP16 (~26 GB) exceeds 24 GB. |
| 14B | Q8 | ~15 GB | Comfortable | Qwen 2.5 14B, Phi-4 at Q8 — very close to FP16 quality. Strong daily driver tier. |
| 30B–32B | Q4_K_M | ~18–20 GB | Comfortable | Opens models that require CPU offload on 12GB cards. Fits with 4–6 GB headroom. |
| 70B | Q4_K_M | ~38–42 GB | Does not fit | Q4 exceeds 24 GB. Q2 (~23 GB) barely fits but quality degrades significantly. Use cloud for 70B at usable quality. |
Best model sizes for this card
Three tiers stand out on the RTX 4090:
- 7B and 8B at FP16: The most important unlock vs 12GB cards. FP16 is the highest quality you can run without sacrificing precision — effectively the same as the model at training time. At 1008 GB/s bandwidth, a 7B FP16 model runs at 40–80 tokens per second depending on the model architecture. This is the right tier for daily chat and fast coding assistance.
- 14B at Q8: Near-FP16 quality on a 14B model at around 15GB of VRAM. This is the sweet spot for depth vs speed. Models like Qwen 2.5 14B and Phi-4 at Q8 are strong daily drivers that punch above their parameter count.
- 30B–32B at Q4_K_M: The high-end use case that 12GB cards cannot touch. A 32B model at Q4_K_M uses approximately 20GB of VRAM — comfortable on 24GB with 4GB headroom for KV cache. Good for complex reasoning, long summarisation, and tasks where model depth matters more than generation speed.
Recommended models
- Qwen 3 8B FP16: Pull with
ollama pull qwen3:8b-fp16. Full precision on a strong 8B model. Fast and high quality — the right default for daily use on this card. - Qwen 2.5 14B Q8: Near-lossless quality on a 14B model at about 15GB VRAM. Strong reasoning and instruction following. Run with
ollama pull qwen2.5:14b-q8_0. - Qwen 2.5 32B Q4_K_M: A 32B model at Q4 — the class of model previously limited to workstation GPUs. Run with
ollama pull qwen2.5:32b. - Phi-4 Mini FP16 — Very fast at full precision. Under 4GB VRAM. Good for tools, agents, and fast autocomplete workflows.
- DeepSeek-R1 Distill Qwen 14B Q8: Reasoning-focused distilled model at Q8 quality. Fits in ~15GB VRAM with strong multi-step reasoning capability.
Recommended runtimes
- Ollama — Best starting point. Handles CUDA automatically, one command to pull and run any model. Use the
:fp16and:q8_0tag suffixes to target specific quantizations. - LM Studio — Desktop GUI for browsing and running models. Good for comparing multiple models side by side. Includes a local API server compatible with the OpenAI SDK.
- Open WebUI — Browser-based chat interface over Ollama. Run via Docker. The cleanest way to get a private web UI without writing any frontend code.
- vLLM: For batched inference, production serving, or running multiple requests in parallel. More setup than Ollama but much higher throughput for server-style workloads. The 4090's bandwidth makes it a capable vLLM host for team use.
Best local AI workflows for this card
- High-quality chat: 7B or 8B at FP16 with fast interactive generation. The 4090 makes FP16 chat feel immediate in a way that 12GB cards running Q4 do not.
- Local coding agents: 14B at Q8 for strong code generation at speed. The high bandwidth means fast autocompletion and short latency for agentic tool call loops.
- Batched RAG pipelines: 30B Q4 for complex document reasoning with large retrieval context. The headroom above model weights absorbs the KV cache growth at longer context.
- Complex reasoning: 30B or 32B Q4 models handle multi-step reasoning tasks that overwhelm 7B-class models. Good for structured output, chain-of-thought, and agent planning.
- Model serving for small teams: vLLM on a 4090 can serve a 7B or 14B model to multiple clients simultaneously at acceptable throughput. A viable low-cost private inference server.
What this hardware cannot do well
- 70B models at usable quality: Q4_K_M needs 38–42GB and does not fit. Q2 barely fits but produces noticeably degraded output for most tasks. For 70B at Q4 quality, you need 48GB+ VRAM or cloud inference.
- 13B at FP16: FP16 on 13B needs approximately 26GB, which exceeds 24GB. The practical maximum is Q8 on 13B, which is excellent quality but not FP16.
- Multi-GPU VRAM pooling: The RTX 4090 has no NVLink support. Two 4090s in the same machine provide 48GB of total VRAM but they cannot be automatically combined for a single model without explicit parallelism support in the runtime.
- Power-constrained environments: The RTX 4090 draws up to 450W at full load. It requires a high-quality PSU (850W+ recommended) and good case airflow. Not a card for a compact or low-power workstation.
Upgrade path
The RTX 4090 is the consumer ceiling. The next real step up is workstation or data centre hardware:
- RTX A6000 (48GB): Professional workstation card. 48GB GDDR6 opens 70B at Q4 comfortably. Much lower bandwidth than the 4090 but twice the VRAM. Available new or used.
- Dual RTX 3090 or dual RTX 4090: Two 24GB cards give 48GB of total addressable VRAM with runtimes that support tensor parallelism. Significant setup overhead and requires explicit multi-GPU runtime support (vLLM, tensor parallel llama.cpp builds).
- A100 / H100 (80GB): Data centre class. 70B at Q8 or FP16 is straightforward. Available as cloud instances via RunPod, Lambda, or Vast.ai if you do not want to own the hardware.
For most local AI use cases through 32B models, the RTX 4090 does not have a consumer successor worth waiting for. If 70B models are your target, go to cloud or workstation hardware.
Cloud fallback
When 70B or larger models are needed, cloud GPU inference is often faster and cheaper than buying 48GB+ workstation hardware for occasional use.
- RunPod: On-demand A100 (80GB) and H100 instances. Good for spot workloads and fine-tuning runs.
- Lambda: ML-focused cloud GPU. A100 and H100 instances with straightforward pricing.
- Vast.ai: Marketplace model. Often the cheapest option for short experimental runs on large models.
Related hardware
FAQ
Can the RTX 4090 run 70B models?
Not comfortably. A 70B model at Q4_K_M needs approximately 38–42 GB of VRAM, which exceeds the 4090's 24 GB. At very aggressive Q2 quantization (~23 GB), a 70B model barely fits, but Q2 quality is significantly degraded — output coherence drops noticeably compared to Q4 or Q8. For 70B models at usable quality, you need a multi-GPU workstation, a server GPU like the H100 80GB, or cloud inference.
What is the biggest practical advantage of 24GB VRAM over 12GB for local AI?
Three things: you can run 7B and 8B models at FP16 (full precision, no quantization quality loss), you can run 13B and 14B models at Q8 rather than Q4, and you can run 30B and 32B models at Q4 — which is simply impossible on 12GB cards without painful CPU offload. The jump from 12GB to 24GB is one of the most impactful VRAM upgrades for local AI use.
Is the RTX 4090 worth it over a used RTX 3090 for local AI?
Both have 24GB VRAM, so the model ceiling is identical. The 4090 has roughly double the memory bandwidth (1008 GB/s vs 936 GB/s for the 3090) and faster CUDA compute, which translates to meaningfully faster token generation — typically 30–50% more tokens per second on the same model. If budget is the constraint, a used RTX 3090 is a strong value option. If maximum generation speed matters, the 4090 is the better choice.
Can the RTX 4090 run 13B models at FP16?
No. FP16 for a 13B model requires approximately 26 GB of VRAM, which exceeds the 4090's 24 GB. The best you can do at 24GB is Q8 for 13B models, which needs about 14 GB and is very close to FP16 quality in practice. For FP16 on 13B, you need a 32GB+ GPU like the A100 or a dual-GPU setup.
What is the best local AI workflow for the RTX 4090?
The RTX 4090 excels at batched RAG pipelines, high-speed coding agents, and any workflow where fast token generation matters. Running 30B Q4 models for complex reasoning, 7B FP16 for high-quality chat, and serving multiple model inference requests sequentially are all strong fits. The 1008 GB/s memory bandwidth makes it significantly faster than 12GB-class cards for the same model.
What Ollama command should I use first on the RTX 4090?
With 24GB, you can start bigger than with 12GB cards. Try `ollama pull qwen2.5:32b` for a large Q4 model, or `ollama pull llama3:8b-fp16` for a small model at full precision. For a fast daily driver, `ollama run qwen3:8b` runs at near-FP16 quality at Q8 with plenty of VRAM headroom.
Disclosure
OpenSourcesAI may earn a commission or referral fee from links to hardware retailers, cloud GPU providers, or partner tools on this page. Editorial assessments are produced independently and are not influenced by commercial relationships. Hardware specs are sourced from manufacturer documentation. Model VRAM estimates are derived from GGUF quantization formulas and may vary across runtime versions and model architectures. Verify before making purchasing decisions.
Check specific model fit for the RTX 4090
Enter your VRAM, RAM, and workflow into the compatibility checker to see which models and quantization levels are recommended for your setup.
For builders
Selling AI hardware, cloud GPUs, or workstation solutions?
Sponsor a contextual placement on this page or submit your product for editorial review on OpenSourcesAI. For sponsorships, email sponsors@opensourcesai.com. For submissions or corrections, use the submit page.