Hardware tier · 48GB VRAM · Reviewed June 2026
What Can 48GB VRAM Run? Workstation-Class Local AI
48GB of GPU memory is the threshold at which 70B models fit in a single memory pool at Q4 quantization. Below this point — at 24GB — 70B models simply do not fit. Above this point, 30B models at Q8 (near-lossless quality) also become possible. This tier covers three different paths to 48GB: an NVIDIA RTX A6000 workstation card, a dual RTX 3090 NVLink bridge, or Apple Silicon 64GB unified memory with its ~48GB accessible pool.
Model fit at 48GB VRAM
The table applies to any configuration with approximately 48 GB of contiguous accessible memory.
| Model size | Best quantization | Memory used | Fits in 48GB? | Notes |
|---|---|---|---|---|
| 1B–14B | FP16 | 2–28 GB | Yes | Full precision on all models up to 14B. Large VRAM headroom. |
| 30B–32B | FP16 | ~60–64 GB | No (Q8 fits) | FP16 on 30B exceeds 48 GB. Q8 on 30B (~34 GB) fits with headroom. |
| 30B–32B | Q8 | ~34 GB | Yes | Near-lossless quality on 30B models. Impossible at 12 or 24 GB. |
| 70B | Q4_K_M | ~38–42 GB | Yes | The headline unlock. Fits with 6–10 GB headroom. 70B at interactive quality. |
| 70B | Q8 | ~74 GB | No | Q8 on 70B requires ~74 GB — exceeds 48 GB. Needs 80 GB+ (H100, A100 80GB) or cloud. |
| 120B+ | Q4 | 70 GB+ | No | Frontier model sizes. Requires multi-GPU nodes or cloud. |
What 48GB unlocks over 24GB
- 70B models at Q4: The critical unlock. A 70B model at Q4_K_M needs ~38–42 GB. This fits in a 48 GB pool with 6–10 GB of headroom. No 24 GB GPU can do this without CPU offload. 70B at Q4 is the practical best-quality 70B that fits here.
- 30B and 32B at Q8: Q8 on a 32B model needs ~34 GB — impossible at 24 GB, comfortable at 48 GB. Near-lossless quality on a 30B model is a meaningful step above 24 GB where Q4 is the ceiling.
- 14B at FP16 with full context headroom: 14B FP16 uses ~28 GB, leaving 20 GB of headroom for KV cache at large context windows. Very long context at FP16 precision on a 14B model becomes practical.
Hardware paths to 48GB
| Hardware | Bandwidth | Type | Notes |
|---|---|---|---|
| NVIDIA RTX A6000 48GB | 768 GB/s | Professional workstation GPU | Single-card 48 GB GDDR6. Supports NVLink for 96 GB dual-card setups. Professional tier pricing. |
| Dual RTX 3090 NVLink | ~600 GB/s cross-link | Consumer dual-GPU (complex setup) | Two RTX 3090 cards bridged via NVLink 3.0 present a 48 GB pool. Requires NVLink bridge, 1000 W+ PSU, and runtime-level support. See RTX 3090 guide for full setup notes. |
| Apple Silicon 64GB Unified Memory | ~400 GB/s | Mac unified memory (macOS only) | The 64 GB pool is ~48 GB accessible for AI (~75%). Metal only — no CUDA runtimes. Single-card simplicity. See full Apple Silicon guide. |
Choosing a path to 48GB
- RTX A6000 48GB: Single card, clean setup, full CUDA ecosystem (Ollama, LM Studio, vLLM, TGI). Professional pricing. Best choice if you want the workstation path with no architectural complexity.
- Dual RTX 3090 NVLink: Two consumer cards bridged into a 48 GB pool. Lower cost than an A6000 in many markets but significantly more complex — requires an NVLink bridge, a high-wattage PSU, explicit runtime support for NVLink, and careful thermal management. Not a beginner setup. See the RTX 3090 guide for full NVLink requirements.
- Apple Silicon 64GB: The simplest consumer path to ~48 GB of accessible AI memory on macOS. No CUDA runtimes, but Ollama, LM Studio, and llama.cpp all support Metal acceleration. Lower tokens-per-second than the A6000 or dual 3090 setup for models under 32B, but the easiest entry to 70B at Q4 without professional GPU pricing.
What 48GB still cannot do
- 70B at Q8: Q8 on a 70B model requires ~74 GB. Needs 80 GB+ (NVIDIA A100 80GB, H100 80GB) or cloud inference.
- 30B at FP16: FP16 on 30B requires ~60 GB. Exceeds 48 GB. Q8 is the practical ceiling for 30B+ at this tier.
- Multiple large models simultaneously: 70B at Q4 uses most of the 48 GB pool. Running a second model alongside it is not practical at this tier.
Cloud fallback for above-48GB workloads
- RunPod: A100 80GB and H100 80GB instances for 70B at Q8 or FP16.
- Lambda: ML-focused cloud GPU compute with A100 and H100 instances.
- Vast.ai: Marketplace pricing — often lowest cost for occasional large-model inference jobs.