Developer stack

llama.cpp GGUF Workflow

Run any quantized GGUF model locally using llama.cpp — no Docker, no cloud keys, no Python runtime required. Build from source and run models directly on CPU or GPU hardware.

Last reviewed: June 2026

Best forRunning GGUF models without a runtime abstraction layer

Hardware8 GB RAM · CPU-only possible

Toolsllama.cpp · Hugging Face GGUF hub

Time to first result15–30 minutes

Bill of materials

Inference engine

llama.cpp →

C++ inference engine for GGUF models — no Python required, runs CPU-only or with CUDA/Metal GPU acceleration

Model source

Hugging Face GGUF models →

Filter by "gguf" library to find ready-to-run quantized files — look for Q4_K_M or Q5_K_M variants for the best speed/quality balance

Recommended GGUF models

Qwen3 8B Q4_K_M →

Strong general model; ~5 GB file — fits on 8 GB VRAM or CPU-only with 16 GB RAM

Mistral 7B Q4_K_M →

Classic GGUF baseline; widely tested and well-documented for llama.cpp

Build llama.cpp and run a GGUF model

1. Clone the repo

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

2. Configure the build

CPU-only works without any flags. Add -DGGML_CUDA=ON for NVIDIA GPUs (requires CUDA toolkit). For Apple Silicon use -DGGML_METAL=ON.

# CPU only (any platform)
cmake -B build

# NVIDIA GPU (requires CUDA toolkit)
cmake -B build -DGGML_CUDA=ON

# Apple Silicon GPU
cmake -B build -DGGML_METAL=ON

3. Compile

cmake --build build --config Release

4. Download a GGUF model file

Install huggingface_hub to download directly. Replace the example model with any GGUF file you prefer.

pip install huggingface_hub
huggingface-cli download Qwen/Qwen3-8B-GGUF \
  Qwen3-8B-Q4_K_M.gguf \
  --local-dir ./models

5. Run inference

Replace the path below with the actual path to your downloaded GGUF file.

./build/bin/llama-cli \
  -m ./models/Qwen3-8B-Q4_K_M.gguf \
  -p "Explain what a local LLM is in one paragraph." \
  -n 256

What is a GGUF file?

GGUF is a binary file format that packages model weights alongside metadata needed to run inference. The format replaced GGML in August 2023 and is the standard for llama.cpp quantized models. Most open-weight models on Hugging Face are available in GGUF with Q4_K_M or Q5_K_M quantization, which offer the best balance between file size, memory use, and output quality.

Hardware requirements

TierHardwareModel guidance

CPU only16 GB+ RAM (fast RAM helps)3B–8B Q4_K_M models; slower generation, works without a GPU

Mixed (partial GPU)8 GB VRAM + 16 GB RAMOffload most layers to GPU; 7B–14B Q4_K_M run comfortably

Full GPU12–24 GB VRAMFull layer offload for 7B–32B models at fast generation speeds

Choosing a quantization level

Q4_K_M — Best default. 4-bit quantization with mid-size tensors kept at higher precision. Good quality-to-size ratio.
Q5_K_M — Slightly better quality than Q4_K_M with a larger file size. Use when you have VRAM headroom.
Q8_0 — Near full-precision quality. Use on high-VRAM machines when quality is the priority.
IQ2_XXS / IQ3_XS — Aggressive compression for very limited RAM. Noticeable quality loss on complex prompts.

Best for

Developers who want direct control over inference without a runtime abstraction
Machines where Docker is unavailable or undesirable
Researchers testing specific GGUF quantizations and benchmark comparisons
Builds targeting embedded devices, edge hardware, or specialized accelerators
Anyone who wants to understand what is actually running at the binary level

Useful flags

-ngl N — Number of layers to offload to GPU. Set to a high number (e.g. 99) to offload as many as VRAM allows.
-c N — Context size in tokens. Default is 2048; increase to 4096 or 8192 for longer conversations.
-t N — Number of CPU threads. Match to your physical core count for best CPU-only performance.
--interactive — Interactive chat mode instead of one-shot completion.

Ollama + Open WebUI starter stack RAG for documents stack What is quantization?What is VRAM?Ollama tool page Qwen3 8B model profile Check your hardware