llama.cpp

Your Ultimate Guide to Running LLMs Locally

Complete documentation, installation guides, optimization tips, and forks comparison for the most efficient C++ implementation of LLaMA and other large language models.

What is llama.cpp?

A high-performance C++ port of Facebook's LLaMA model, enabling efficient inference on consumer hardware.

Zero Dependencies

Pure C++ implementation with no dependencies. Just compile and run on CPU or GPU.

Optimized Inference

Highly optimized for 4-bit and 5-bit quantized models. Supports GPU acceleration via CUDA, Metal, and Vulkan.

Multiple Models

Supports LLaMA, LLaMA 2, Falcon, Wizard, Vicuna, and many more GGUF models.

60k+
GitHub Stars
Q4_K_M
Recommended Quant
CUDA
GPU Support
8GB+
RAM Required

Installation Guide

Multiple ways to get llama.cpp running on your machine.

Option 1: Pre-built Releases (Windows)

# Download from:
https://github.com/ggerganov/llama.cpp/releases

# Look for: llama-b[BUILD]-bin-win-[ARCH]-[BUILD_TYPE].zip
# llama-b[BUILD]-macOS-[ARCH].zip
# llama-b[BUILD]-bin-ubuntu-[ARCH].[EXT]

Option 2: Build from Source

Mac/Linux

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

Windows (CMake)

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake .
cmake --build . --config Release

With CUDA Support (GPU)

make LLAMA_CUDA=1
# or for Windows
cmake -DLLAMA_CUDA=ON .
cmake --build . --config Release

Option 3: Docker

docker pull ghcr.io/ggerganov/llama.cpp:latest
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:latest --api -m /models/your-model.gguf

Usage & Commands

Master the CLI and server modes.

Basic Commands

Simple Inference

./main -m models/model.gguf -p "Your prompt here"

Interactive Chat

./main -m models/model.gguf --interactive-server

Server Mode (API)

./server -m models/model.gguf -c 2048 --port 8080

Important Flags

-c, --ctx_size Context size (e.g., 2048, 4096, 8192)
-n, --n_predict Number of tokens to predict (-1 = infinity)
-t, --threads Number of CPU threads (recommend: physical cores)
--temp Temperature (0.8 is standard)
--gpu_layers/-ngl Number of layers to offload to GPU

API Example (OpenAI-compatible)

# Start server
./server -m models/llama-2-7b-chat.Q4_K_M.gguf -c 4096

# Send request
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Popular Forks & Projects

Specialized versions and wrappers built on llama.cpp

Ollama

Top Pick

The easiest way to run LLaMA, Mistral, and other models locally. Provides a CLI and API for running models.

  • • macOS, Linux, Windows
  • • Model library available
  • • REST API included
  • • One-liner install
ollama.ai

TextGen WebUI

GUI

A web interface for LLMs. Supports multiple backends including llama.cpp with extensive character/persona features.

  • • Web-based interface
  • • Chat/Completions modes
  • • LoRA support
  • • Extensions support
GitHub

KoboldCpp

Gaming

A user-friendly wrapper for llama.cpp optimized for story writing and text adventure gaming.

  • • Kobold AI compatibility
  • • Streamlined UI
  • • Adventure mode
  • • Easy single-binary setup
GitHub

llamafile

Portable

Mozilla's project. LLMs packaged as single executable files that run on most computers without dependencies.

  • • Single-file executables
  • • No installation needed
  • • Cross-platform
  • • Embeddable
GitHub

LocalAI

API

Drop-in OpenAI API replacement. Self-hosted with llama.cpp backend. Supports text generation, images, and audio.

  • • OpenAI API compatible
  • • Docker ready
  • • Model hot-reloading
  • • Multiple backends
localai.io

LM Studio

App

Desktop application for running local LLMs with a beautiful interface. Easy model downloading and chatting.

  • • Desktop GUI
  • • Chat interface
  • • HuggingFace integration
  • • macOS/Windows
lmstudio.ai

Optimization Tips

Get the best performance from your models.

Quantization Guide (GGUF)

Q4_K_M Best balance

Recommended for most models. ~4.7GB for 7B model. Quality slightly better than Q4_0.

Q5_K_M High quality

For quality-critical tasks. ~5.8GB for 7B model.

Q8_0 Maximum quality

Almost unnoticeable loss. ~7GB for 7B model.

Performance Tips

  • GPU Offloading: Use -ngl 35 on macOS (Metal) or -ngl 33 on NVIDIA (CUDA) for 7B models
  • Context Size: Start with 2048, increase based on your needs (uses more VRAM)
  • Threads: Set to your physical CPU cores count with -t [cores]
  • Memory Mapping: Enable with --mlock to keep model in RAM
  • Batch Size: Increase -b 512 for higher throughput

Hardware Requirements

7B Model
~4-8GB RAM
M1 Mac Minimum
13B Model
~8-12GB RAM
Requires GPU
30B Model
~20GB RAM
32GB System + GPU
70B Model
~40GB+ RAM
High-end GPU

Converting Models

Converting Safetensors/PyTorch models to GGUF format for llama.cpp:

# Install dependencies
python -m pip install gguf protobuf

# Convert HuggingFace model to GGUF
python convert-hf-to-gguf.py /path/to/model \
  --outfile /path/to/output/model.gguf \
  --outtype q4_k_m

Available outtypes: f32, f16, bf16, q8_0, q4_0, q4_1, q4_k_s, q4_k_m, q5_k_s, q5_k_m, q6_k

Where to Download Models

Pre-converted GGUF models ready to use.

TheBloke

The most popular source for quantized GGUF models. Hundreds of models including Llama 2, Mistral, CodeLlama, and more.

View on HuggingFace

NousResearch

Research-focused models including Hermes, Synthia, and other fine-tuned versions with GGUF support.

View on HuggingFace

LWDW (RunPod)

Specialized in large model variants (70B+) and unique quantizations. Great for GPU cloud inference.

View on HuggingFace

Popular Models to Try

mistral-7b-instruct

Fast, great quality

llama-2-7b/13b-chat

All-purpose, balanced

codellama-7b/13b

Code generation

neural-chat-7b

Conversations

Advanced Features

Unlock the full potential of llama.cpp.

Speculative Decoding

Use a smaller draft model to speed up token generation. Can achieve 2-3x speedup on supported hardware.

./main -m large_model.gguf --draft small_model.gguf -ngl 35 --draft 10

Grammar-Based Sampling

Force JSON output or specific formats using GBNF grammar files. Perfect for structured output.

./main -m model.gguf --grammar-file json.gbnf -p "Generate JSON:"

Continuous Batching

Process multiple prompts simultaneously in server mode for higher throughput in production environments.

LoRA Support

Load LoRA adapters on top of base models without merging. Hot-swap adapters at runtime.

./main -m base.gguf --lora adapter.bin --lora-scale 0.8

Troubleshooting

Common issues and solutions.

CUDA Out of Memory

Error: CUDA out of memory

Reduce GPU layers or use a smaller model. Try -ngl 20 instead of -ngl 35, or use a Q4_K_M quantized model instead of Q5.

Slow Token Generation

Model is running on CPU instead of GPU.

Ensure you built with CUDA/Metal support. Check nvidia-smi or Activity Monitor to verify GPU usage. Increase -ngl to offload more layers.

GGUF Format Errors

Error: invalid magic

Your llama.cpp version is too old for this GGUF file. Pull latest changes and rebuild, or download an older GGUF version (v1 or v2).

Model Output is Gibberish

Random characters or nonsensical output.

Usually indicates wrong tokenizer or incompatible model. Ensure you're using the correct prompt template for the model (e.g., ChatML for Mistral).

Community & Resources

Get help and stay updated.