Developer stack
llama.cpp GGUF Workflow
Run any quantized GGUF model locally using llama.cpp — no Docker, no cloud keys, no Python runtime required. Build from source and run models directly on CPU or GPU hardware.
Bill of materials
Inference engine
C++ inference engine for GGUF models — no Python required, runs CPU-only or with CUDA/Metal GPU acceleration
Model source
Filter by "gguf" library to find ready-to-run quantized files — look for Q4_K_M or Q5_K_M variants for the best speed/quality balance
Recommended GGUF models
Strong general model; ~5 GB file — fits on 8 GB VRAM or CPU-only with 16 GB RAM
Classic GGUF baseline; widely tested and well-documented for llama.cpp
Build llama.cpp and run a GGUF model
1. Clone the repo
git clone https://github.com/ggml-org/llama.cpp cd llama.cpp
2. Configure the build
CPU-only works without any flags. Add -DGGML_CUDA=ON for NVIDIA GPUs (requires CUDA toolkit). For Apple Silicon use -DGGML_METAL=ON.
# CPU only (any platform) cmake -B build # NVIDIA GPU (requires CUDA toolkit) cmake -B build -DGGML_CUDA=ON # Apple Silicon GPU cmake -B build -DGGML_METAL=ON
3. Compile
cmake --build build --config Release
4. Download a GGUF model file
Install huggingface_hub to download directly. Replace the example model with any GGUF file you prefer.
pip install huggingface_hub huggingface-cli download Qwen/Qwen3-8B-GGUF \ Qwen3-8B-Q4_K_M.gguf \ --local-dir ./models
5. Run inference
Replace the path below with the actual path to your downloaded GGUF file.
./build/bin/llama-cli \ -m ./models/Qwen3-8B-Q4_K_M.gguf \ -p "Explain what a local LLM is in one paragraph." \ -n 256
What is a GGUF file?
GGUF is a binary file format that packages model weights alongside metadata needed to run inference. The format replaced GGML in August 2023 and is the standard for llama.cpp quantized models. Most open-weight models on Hugging Face are available in GGUF with Q4_K_M or Q5_K_M quantization, which offer the best balance between file size, memory use, and output quality.
Hardware requirements
Choosing a quantization level
- Q4_K_M — Best default. 4-bit quantization with mid-size tensors kept at higher precision. Good quality-to-size ratio.
- Q5_K_M — Slightly better quality than Q4_K_M with a larger file size. Use when you have VRAM headroom.
- Q8_0 — Near full-precision quality. Use on high-VRAM machines when quality is the priority.
- IQ2_XXS / IQ3_XS — Aggressive compression for very limited RAM. Noticeable quality loss on complex prompts.
Best for
- Developers who want direct control over inference without a runtime abstraction
- Machines where Docker is unavailable or undesirable
- Researchers testing specific GGUF quantizations and benchmark comparisons
- Builds targeting embedded devices, edge hardware, or specialized accelerators
- Anyone who wants to understand what is actually running at the binary level
Useful flags
- -ngl N — Number of layers to offload to GPU. Set to a high number (e.g. 99) to offload as many as VRAM allows.
- -c N — Context size in tokens. Default is 2048; increase to 4096 or 8192 for longer conversations.
- -t N — Number of CPU threads. Match to your physical core count for best CPU-only performance.
- --interactive — Interactive chat mode instead of one-shot completion.