Hardware · CPU-only inference · Reviewed June 2026

Running Local LLMs Without a GPU: CPU-Only Local AI

Q: Can you run a local LLM without a GPU?

Yes. Tools like Ollama and llama.cpp will automatically fall back to CPU inference if no compatible GPU is detected. You can run any GGUF-format model on a CPU. The tradeoff is speed — CPU inference is typically 5–20x slower than GPU inference for the same model and quantization level. Small models (1B–3B at Q4) can be interactive on a modern CPU. Larger models (7B+) become slow for chat but are usable for batch tasks.

Q: How much RAM do I need to run a local LLM without a GPU?

16 GB of system RAM is the practical minimum for a 7B model at Q4 (which uses ~4–5 GB). The rule is: model size in bytes = (parameters × bits per weight / 8) + ~1.5 GB overhead. A 7B model at Q4 uses approximately 4 GB. A 13B at Q4 uses approximately 8 GB. Add your OS overhead (~2–4 GB) to determine whether your RAM budget is sufficient.

Q: Which CPU is best for running local LLMs?

For x86 CPUs (Intel/AMD), look for AVX2 support at minimum — it is present in almost all CPUs made since 2015. AVX-512 support gives a meaningful speed boost on compatible Intel Xeon and some consumer Intel 11th gen+ chips. For the fastest CPU-only inference, Apple Silicon (M1, M2, M3, M4 series) is significantly faster than any x86 CPU due to its high memory bandwidth and unified memory architecture — see the Apple Silicon guide for details.

Q: What is the best model for CPU-only inference?

For interactive chat on a CPU, 1B–3B models at Q4 offer the best balance of quality and speed. Phi-4 Mini (3.8B) and Qwen 3 1.7B are strong options. For batch tasks where generation speed is less critical, 7B at Q4 provides substantially better output quality while remaining feasible on 16 GB+ RAM.

You do not need a GPU to run a local LLM. Tools like Ollama and llama.cpp automatically use CPU inference if no compatible GPU is found. What you give up is speed — CPU inference is typically 5–20x slower than an equivalent discrete GPU for the same model. Small models (1B–3B at Q4) can be genuinely interactive on a modern CPU. Larger models (7B+) become slow for chat but remain usable for batch tasks where generation speed is secondary.

What CPU-only inference can run

On CPU, system RAM replaces VRAM as the memory ceiling. The same model size formulas apply — the difference is that system RAM is slower to read than GPU VRAM, which is why tokens-per-second are so much lower.

Model size	Quantization	RAM needed	Typical speed	Verdict
1B–3B	Q4	4–6 GB	5–20 t/s	Practical
7B	Q4	~4–5 GB	2–8 t/s	Slow but usable
13B	Q4	~7–8 GB	1–4 t/s	Very slow
14B	Q4	~8–9 GB	1–3 t/s	Very slow
30B–32B	Q4	~18–20 GB	<1 t/s	Not recommended for chat

Speed figures are approximate for a modern Intel or AMD desktop CPU with AVX2 support. Actual performance varies widely by CPU model, RAM speed (DDR4 vs DDR5), and number of memory channels. Apple Silicon CPUs run significantly faster than x86 at equivalent model sizes — see the Apple Silicon guide.

Why CPU inference is slow

LLM token generation is memory-bandwidth-bound. At each step, the model weights are read from memory through the compute units. Memory bandwidth determines how fast this happens — more bandwidth means more tokens per second.

System RAM bandwidth (DDR5): approximately 50–100 GB/s
RTX 3060 12GB GDDR6: 360 GB/s — roughly 4–7x faster
RTX 4090 24GB GDDR6X: 1008 GB/s — roughly 10–20x faster

This is the fundamental bottleneck. A 7B model at Q4 fits easily in 16 GB of system RAM, but reading those weights through 60 GB/s of DDR4 bandwidth generates 2–4 tokens per second instead of 30–50 on a discrete GPU.

RAM requirements for CPU inference

16 GB system RAM: Minimum for 7B at Q4 (~4–5 GB model). Leaves ~10 GB for OS and other applications.
32 GB system RAM: Comfortable for 7B and 13B at Q4. Better multitasking while a model is loaded.
64 GB system RAM: Can load 30B at Q4 (~18–20 GB), though generation will be very slow (<1 t/s).

RAM speed matters on CPU inference. DDR5 and dual-channel memory configurations produce higher effective bandwidth than DDR4 single-channel setups. If you are buying RAM specifically for local AI use, dual-channel DDR5 is the better choice.

CPU requirements

AVX2 support (required): llama.cpp and Ollama use AVX2 SIMD instructions for quantized inference. Almost all CPUs made since 2015 support AVX2. Check with grep avx2 /proc/cpuinfo on Linux or CPU-Z on Windows.
AVX-512 (optional speedup): Some Intel Xeon and consumer Intel processors (11th gen Rocket Lake and later) support AVX-512, which gives a meaningful speed increase. AMD Zen 4 (Ryzen 7000+) also supports AVX-512.
Core count: llama.cpp can parallelize across CPU cores. More cores help with prefill (processing the input prompt) but the memory bandwidth bottleneck dominates token generation speed regardless of core count.

Best models for CPU inference

Qwen 3 1.7B Q4: ollama pull qwen3:1.7b. Fast enough for interactive chat on any modern CPU. The recommended starting point.
Phi-4 Mini Q4 (3.8B): ollama pull phi4-mini. Strong reasoning in a small package. Good for coding assistance at CPU-feasible speeds.
Qwen 3 4B Q4: ollama pull qwen3:4b. Higher quality than 1.7B with still-acceptable speed on CPUs with 16 GB+ RAM.
Qwen 2.5 7B Q4 (batch only): ollama pull qwen2.5:7b. For non-interactive use — document summarisation, structured extraction, occasional queries. 2–4 t/s is tolerable if you are not waiting for a live response.

Runtimes for CPU inference

Ollama — Automatically detects whether a GPU is available and falls back to CPU inference. No configuration needed. The easiest path for CPU-only machines.
LM Studio — Desktop GUI with explicit CPU inference mode. Good for browsing models and comparing quality without a terminal.
llama.cpp (CPU-native build): Compile without GPU backends for the most efficient CPU inference. Exposes -t (threads) and --n-gpu-layers 0 flags for explicit CPU-only mode.

Getting started on CPU

# Install Ollama — auto-detects CPU-only if no GPU is found
curl -fsSL https://ollama.ai/install.sh | sh

# Start with a small model that runs at interactive speed on CPU
ollama pull qwen3:1.7b
ollama run qwen3:1.7b

# For batch tasks (not interactive), try 7B
ollama pull qwen2.5:7b
ollama run qwen2.5:7b

When to consider a GPU

CPU inference is a valid entry point and permanent solution for small models. Consider a discrete GPU when:

7B models feel too slow for interactive use — even at Q4, 2–4 t/s means waiting several seconds for each sentence. A 12GB GPU at Q8 runs the same model at 20–40 t/s.
You need 13B or larger models at interactive quality — 13B at Q4 on CPU is <3 t/s. A 12GB GPU runs it at 10–20 t/s.
You run models frequently throughout the day — the accumulated time savings from a GPU are significant for daily use.

If budget is the primary constraint, a used RTX 3060 12GB is the lowest-cost entry into GPU-accelerated local AI at meaningful speed improvements over CPU.

FAQ

Can you run a local LLM without a GPU?