Developer stack

llama.cpp GGUF Workflow

Run any quantized GGUF model locally using llama.cpp — no Docker, no cloud keys, no Python runtime required. Build from source and run models directly on CPU or GPU hardware.

Best forRunning GGUF models without a runtime abstraction layer
Hardware8 GB RAM · CPU-only possible
Toolsllama.cpp · Hugging Face GGUF hub
Time to first result15–30 minutes

Bill of materials

Inference engine

llama.cpp

C++ inference engine for GGUF models — no Python required, runs CPU-only or with CUDA/Metal GPU acceleration

Model source

Hugging Face GGUF models

Filter by "gguf" library to find ready-to-run quantized files — look for Q4_K_M or Q5_K_M variants for the best speed/quality balance

Recommended GGUF models

Qwen3 8B Q4_K_M

Strong general model; ~5 GB file — fits on 8 GB VRAM or CPU-only with 16 GB RAM

Mistral 7B Q4_K_M

Classic GGUF baseline; widely tested and well-documented for llama.cpp

Build llama.cpp and run a GGUF model

1. Clone the repo

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

2. Configure the build

CPU-only works without any flags. Add -DGGML_CUDA=ON for NVIDIA GPUs (requires CUDA toolkit). For Apple Silicon use -DGGML_METAL=ON.

# CPU only (any platform)
cmake -B build

# NVIDIA GPU (requires CUDA toolkit)
cmake -B build -DGGML_CUDA=ON

# Apple Silicon GPU
cmake -B build -DGGML_METAL=ON

3. Compile

cmake --build build --config Release

4. Download a GGUF model file

Install huggingface_hub to download directly. Replace the example model with any GGUF file you prefer.

pip install huggingface_hub
huggingface-cli download Qwen/Qwen3-8B-GGUF \
  Qwen3-8B-Q4_K_M.gguf \
  --local-dir ./models

5. Run inference

Replace the path below with the actual path to your downloaded GGUF file.

./build/bin/llama-cli \
  -m ./models/Qwen3-8B-Q4_K_M.gguf \
  -p "Explain what a local LLM is in one paragraph." \
  -n 256

What is a GGUF file?

GGUF is a binary file format that packages model weights alongside metadata needed to run inference. The format replaced GGML in August 2023 and is the standard for llama.cpp quantized models. Most open-weight models on Hugging Face are available in GGUF with Q4_K_M or Q5_K_M quantization, which offer the best balance between file size, memory use, and output quality.

Hardware requirements

TierHardwareModel guidance
CPU only16 GB+ RAM (fast RAM helps)3B–8B Q4_K_M models; slower generation, works without a GPU
Mixed (partial GPU)8 GB VRAM + 16 GB RAMOffload most layers to GPU; 7B–14B Q4_K_M run comfortably
Full GPU12–24 GB VRAMFull layer offload for 7B–32B models at fast generation speeds

Choosing a quantization level

  • Q4_K_M — Best default. 4-bit quantization with mid-size tensors kept at higher precision. Good quality-to-size ratio.
  • Q5_K_M — Slightly better quality than Q4_K_M with a larger file size. Use when you have VRAM headroom.
  • Q8_0 — Near full-precision quality. Use on high-VRAM machines when quality is the priority.
  • IQ2_XXS / IQ3_XS — Aggressive compression for very limited RAM. Noticeable quality loss on complex prompts.

Best for

  • Developers who want direct control over inference without a runtime abstraction
  • Machines where Docker is unavailable or undesirable
  • Researchers testing specific GGUF quantizations and benchmark comparisons
  • Builds targeting embedded devices, edge hardware, or specialized accelerators
  • Anyone who wants to understand what is actually running at the binary level

Useful flags

  • -ngl N — Number of layers to offload to GPU. Set to a high number (e.g. 99) to offload as many as VRAM allows.
  • -c N — Context size in tokens. Default is 2048; increase to 4096 or 8192 for longer conversations.
  • -t N — Number of CPU threads. Match to your physical core count for best CPU-only performance.
  • --interactive — Interactive chat mode instead of one-shot completion.

Related pages