Guide · Local AI · Reviewed June 2026
What Is a Local LLM?
A local LLM is a large language model that runs entirely on your own hardware. The model weights live on your disk, inference happens on your CPU or GPU, and no data leaves your machine. This guide explains what that means in practice, what hardware you need, and how to get started.
Editorial review
AI tools, model releases, pricing, licenses, and platform terms can change quickly. Verify the official source before production or commercial use.
Plain-English definition
A large language model (LLM) is a neural network trained to predict and generate text. When you use ChatGPT, Claude, or Gemini, the model runs on the provider's servers and you interact with it over the internet.
A local LLM is the same kind of model, but instead of running on remote servers it runs on hardware you control: your laptop, desktop, workstation, or home server. The model weights — the billions of numerical parameters that encode what the model knows — are downloaded to your disk and loaded into your machine's memory when you run inference.
The practical consequence is that your prompts and responses never leave your machine. That is the central appeal for privacy-sensitive tasks, offline use, and cost control at scale.
How inference works locally
When you send a prompt to a local model, the following happens entirely on your hardware:
- The runtime loads the model weights from disk into RAM or GPU memory (VRAM).
- Your prompt is tokenised — converted into numerical tokens the model can process.
- The model runs a forward pass through its layers to predict the next token, then the next, until it reaches a stopping point.
- The generated tokens are decoded back into text and returned to your terminal, app, or chat interface.
Speed depends on how much of the model fits into GPU memory. When the full model fits in VRAM, inference is fast. When it spills into system RAM, it slows significantly. When it runs on CPU only, it is slower still but still usable for smaller models.
Key concepts you will encounter
| Concept | What it means for you |
|---|---|
| Parameters (B = billion) | A rough measure of model capacity. A 7B model has 7 billion numerical weights. Larger models are generally more capable but need more memory. |
| Quantization (Q4, Q8…) | Compressing weight precision to reduce memory use. Q4 uses roughly half the memory of Q8 with a small quality tradeoff. Most beginners should start with Q4_K_M or Q4_K_S. |
| VRAM | GPU memory. The most important hardware constraint for local LLMs. The model needs to fit here for fast inference. See the VRAM guide for sizing help. |
| Context window | How many tokens (roughly: words) the model can read and generate in one conversation. Larger context windows use more memory. |
| GGUF format | The most common file format for quantized local models compatible with llama.cpp-based runtimes including Ollama and LM Studio. |
| Open-weight model | A model whose weights are publicly released. Most practical local LLMs are open-weight. Examples: Llama 3, Qwen3, Mistral, Gemma, Phi. |
| Tokens per second (t/s) | The speed at which the model generates output. 10–20 t/s feels comfortable for reading. Below 5 t/s can feel slow for interactive use. |
What hardware do you actually need?
The right hardware depends on the models you want to run. Here is a practical starting guide by hardware tier:
- CPU-only (no GPU or integrated graphics): Run small quantized models (1B–3B parameters). Expect slow responses. Good for testing the workflow, not daily use.
- 8 GB VRAM (e.g. RTX 3070, RTX 4060): Run compact 7B-class quantized models comfortably. This is the most common starting point for practical local AI.
- 12 GB VRAM (e.g. RTX 3080 12G, RTX 4070): Run 7B and 14B-class quantized models. More room for context and multitasking.
- 16–24 GB VRAM (e.g. RTX 3090, RTX 4090): Run 14B–32B quantized models with large context windows. Capable enough for most local coding, RAG, and summarisation tasks.
- Apple Silicon (M-series unified memory): Unified memory architecture means RAM and GPU share the same pool. A MacBook Pro M3 Max with 48 GB or 64 GB can run surprisingly large models quickly.
- Multi-GPU or 48 GB+ VRAM: Enables 70B-class models at useful speeds. Primarily for researchers, production deployments, or dedicated AI workstations.
When in doubt, start smaller than you think you need. A 7B model running at 25 t/s is more useful than a 34B model running at 2 t/s.
Local LLM runners: which one to use
| Runner | Interface | When to use it |
|---|---|---|
| Ollama | CLI + local API | Best all-round starting point. Handles model download, runtime, and a local API endpoint. Easy to connect to Open WebUI and coding tools. |
| LM Studio | Desktop GUI + local API | Visual model browser and chat interface. Good for users who prefer clicking over terminal commands. |
| Jan | Desktop GUI + local API | Open-source desktop app with a clean interface. Built-in extensions and offline-first design. |
| llama.cpp | CLI | Low-level C++ runtime. High portability and hardware support. More configuration required than Ollama. |
| GPT4All | Desktop GUI | Beginner-friendly desktop app. Good for a quick first run on consumer hardware with no setup friction. |
Most builders should start with Ollama. It is the simplest path from zero to a running local model with an API that other tools (Open WebUI, coding assistants, scripts) can connect to immediately.
Which open-weight model families to try
Qwen3 (Alibaba Cloud)
Strong general-purpose and coding performance across a range of sizes (0.6B to 235B). The 8B and 14B quantized variants are practical on 8–12 GB VRAM. Thinking mode is available on select sizes.
Llama 3 (Meta)
Widely supported across runtimes and tools. Good baseline for general chat, coding, and instruction following. The 8B model is a popular starting point for builders new to local AI.
Mistral / Mixtral
Efficient models with strong instruction following. Mistral 7B and Mistral Small variants are practical on consumer hardware and well-supported by local runners.
Gemma 3 (Google)
Compact and performant. The Gemma 3 4B and 12B variants run well on consumer GPUs and show strong performance per parameter on instruction following and reasoning tasks.
Phi-4 (Microsoft)
Small but surprisingly capable. Phi-4 Mini and Phi-4 are good choices when you want a fast, low-memory model for coding help, Q&A, and simple reasoning without sacrificing too much quality.
DeepSeek-R1 (DeepSeek)
Strong reasoning and coding model family. Distilled variants (8B, 14B) are practical locally and show competitive performance on reasoning benchmarks for their size.
When local LLMs are the right choice
| Use case | Local LLM fit | Notes |
|---|---|---|
| Private coding help | Good fit | Completions, explanations, refactors, and shell commands stay on your machine. |
| Document summarisation | Good fit | Summarise private documents without uploading them to a third-party API. |
| Local RAG (retrieval-augmented generation) | Good fit with setup | Combine a local model with a vector database to chat with your own files. |
| Long complex reasoning chains | Use with caution | Frontier hosted models still outperform most local alternatives here. |
| Real-time production traffic | Check hardware carefully | Throughput depends heavily on GPU, quantization, and batching setup. |
| Image generation | Different toolchain | Stable Diffusion and similar models are separate from LLMs; different tools and hardware considerations apply. |
Privacy and data control
Privacy is the most common reason builders choose local models. When the model runs entirely on your hardware:
- Your prompts never leave your machine.
- Responses are generated locally with no network call to a provider.
- No data is logged by a third party or used to train future models.
- You can work fully offline once the model is downloaded.
- Sensitive documents, code, credentials, and personal data stay in your control.
Privacy guarantees hold only as far as your own setup: if you connect a local model to an app that also logs to an external service, that external logging may still capture data. Audit the full data path, not just the model.
Cost and offline use
Running a local model has no per-token API cost. After the one-time hardware investment, inference is effectively free. This matters for high-volume use cases: batch summarisation, automated code review pipelines, document processing, or any workflow that sends thousands of prompts.
Local models also work offline. Once the weights are downloaded, you can run the model on a plane, in a restricted network environment, or in a data centre with no outbound internet access.
Tradeoffs to understand before you start
- Quality ceiling: Most local models on consumer hardware are weaker than frontier hosted models on complex reasoning, very long context, and nuanced instruction following. Test your actual tasks.
- Setup overhead: Installing a runner, downloading model weights, configuring context and quantization, and debugging memory errors takes more effort than signing up for an API.
- Hardware dependency: Performance is tied to your hardware. Upgrading capability means upgrading hardware, not just switching a model ID.
- Model updates: You manage updates manually. Cloud providers update their models silently; with local models, pulling a new version is your responsibility.
- Memory pressure: Running a large local model alongside a browser, IDE, and other apps can cause memory contention. A dedicated inference machine removes this friction.
Getting started: the right order
- Check your VRAM (GPU memory) and pick a model size that fits comfortably — leave 2 GB headroom.
- Install Ollama from the official site and verify it runs with
ollama --version. - Pull one small model:
ollama pull qwen3:8borollama pull llama3:8b. - Run a quick test prompt:
ollama run qwen3:8b "Explain local LLMs in three sentences." - Log the response speed, quality, and memory use before adding any other tools.
- Only after the base model works reliably: add Open WebUI, a coding assistant, or a RAG layer.
Sources
FAQ
Do I need a GPU to run a local LLM?
No. Smaller quantized models can run on CPU only, but an NVIDIA GPU (or Apple silicon with unified memory) makes larger models faster and more practical. CPU-only setups work best for lightweight experimentation, not sustained daily use.
What is the difference between a local LLM and a cloud LLM?
A local LLM runs entirely on your own hardware: the model weights load into your RAM or GPU memory and inference happens on your machine. A cloud LLM is hosted on remote servers; you send prompts over the internet and receive responses, but the model and compute belong to the provider.
How much storage does a local LLM need?
Storage requirements vary by model size and quantization. A compact 7B-parameter model quantized to 4-bit may need 4–6 GB of disk space. Larger models (14B, 32B, 70B) need proportionally more. Check the model card for the specific checkpoint you plan to use.
What does quantization mean for local LLMs?
Quantization reduces the numerical precision of the model weights (for example, from 16-bit floats to 4-bit integers) to lower memory use and improve speed. Q4 quantized models run on far less VRAM than full-precision versions, with a modest quality tradeoff that varies by task and model family.
Are local LLMs as good as ChatGPT or Claude?
State-of-the-art local open-weight models have closed much of the gap for everyday tasks, but the very best frontier hosted models still outperform most local alternatives on complex reasoning, instruction following, and coding. The right comparison is: which local model is good enough for your specific task on your specific hardware?
What is the easiest way to start running a local LLM?
Install Ollama, then run one small model from the terminal. Ollama handles model download, runtime, and a local API in a single tool. LM Studio is a good alternative if you prefer a desktop GUI. Start with a model smaller than your hardware clearly supports, verify it works, then test larger models.
Next step: check your hardware before downloading
Use the local LLM compatibility checker to match model sizes to your GPU memory before spending time on a model that won't run well.
For builders
Building tools for local AI users?
Submit your tool for review or sponsor a clearly labeled placement on OpenSourcesAI. For sponsorships, email sponsors@opensourcesai.com. For submissions or corrections, use the submit page.