llama.cpp

What is llama.cpp?

A high-performance C++ port of Facebook's LLaMA model, enabling efficient inference on consumer hardware.

Zero Dependencies

Pure C++ implementation with no dependencies. Just compile and run on CPU or GPU.

Optimized Inference

Highly optimized for 4-bit and 5-bit quantized models. Supports GPU acceleration via CUDA, Metal, and Vulkan.

Multiple Models

Supports LLaMA, LLaMA 2, Falcon, Wizard, Vicuna, and many more GGUF models.

60k+

GitHub Stars

Q4_K_M

Recommended Quant

CUDA

GPU Support

8GB+

RAM Required

Installation Guide

Multiple ways to get llama.cpp running on your machine.

Option 1: Pre-built Releases (Windows)


# Download from:

https://github.com/ggerganov/llama.cpp/releases


# Look for: llama-b[BUILD]-bin-win-[ARCH]-[BUILD_TYPE].zip

# llama-b[BUILD]-macOS-[ARCH].zip

# llama-b[BUILD]-bin-ubuntu-[ARCH].[EXT]

Option 2: Build from Source

Mac/Linux

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

Windows (CMake)

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake .
cmake --build . --config Release

With CUDA Support (GPU)

make LLAMA_CUDA=1
# or for Windows
cmake -DLLAMA_CUDA=ON .
cmake --build . --config Release

Option 3: Docker

docker pull ghcr.io/ggerganov/llama.cpp:latest
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:latest --api -m /models/your-model.gguf

Usage & Commands

Master the CLI and server modes.

Basic Commands

Simple Inference

./main -m models/model.gguf -p "Your prompt here"

Interactive Chat

./main -m models/model.gguf --interactive-server

Server Mode (API)

./server -m models/model.gguf -c 2048 --port 8080

Important Flags

-c, --ctx_size Context size (e.g., 2048, 4096, 8192)

-n, --n_predict Number of tokens to predict (-1 = infinity)

-t, --threads Number of CPU threads (recommend: physical cores)

--temp Temperature (0.8 is standard)

--gpu_layers/-ngl Number of layers to offload to GPU

API Example (OpenAI-compatible)

# Start server
./server -m models/llama-2-7b-chat.Q4_K_M.gguf -c 4096

# Send request
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Popular Forks & Projects

Specialized versions and wrappers built on llama.cpp

Ollama

Top Pick

The easiest way to run LLaMA, Mistral, and other models locally. Provides a CLI and API for running models.

• macOS, Linux, Windows
• Model library available
• REST API included
• One-liner install

ollama.ai

TextGen WebUI

GUI

A web interface for LLMs. Supports multiple backends including llama.cpp with extensive character/persona features.

• Web-based interface
• Chat/Completions modes
• LoRA support
• Extensions support

GitHub

KoboldCpp

Gaming

A user-friendly wrapper for llama.cpp optimized for story writing and text adventure gaming.

• Kobold AI compatibility
• Streamlined UI
• Adventure mode
• Easy single-binary setup

GitHub

llamafile

Portable

Mozilla's project. LLMs packaged as single executable files that run on most computers without dependencies.

• Single-file executables
• No installation needed
• Cross-platform
• Embeddable

GitHub

LocalAI

API

Drop-in OpenAI API replacement. Self-hosted with llama.cpp backend. Supports text generation, images, and audio.

• OpenAI API compatible
• Docker ready
• Model hot-reloading
• Multiple backends

localai.io

LM Studio

App

Desktop application for running local LLMs with a beautiful interface. Easy model downloading and chatting.

• Desktop GUI
• Chat interface
• HuggingFace integration
• macOS/Windows

lmstudio.ai

Optimization Tips

Get the best performance from your models.

Quantization Guide (GGUF)

Q4_K_M Best balance

Recommended for most models. ~4.7GB for 7B model. Quality slightly better than Q4_0.

Q5_K_M High quality

For quality-critical tasks. ~5.8GB for 7B model.

Q8_0 Maximum quality

Almost unnoticeable loss. ~7GB for 7B model.

Performance Tips

• GPU Offloading: Use -ngl 35 on macOS (Metal) or -ngl 33 on NVIDIA (CUDA) for 7B models
• Context Size: Start with 2048, increase based on your needs (uses more VRAM)
• Threads: Set to your physical CPU cores count with -t [cores]
• Memory Mapping: Enable with --mlock to keep model in RAM
• Batch Size: Increase -b 512 for higher throughput

Hardware Requirements

7B Model

~4-8GB RAM

M1 Mac Minimum

13B Model

~8-12GB RAM

Requires GPU

30B Model

~20GB RAM

32GB System + GPU

70B Model

~40GB+ RAM

High-end GPU

Converting Models

Converting Safetensors/PyTorch models to GGUF format for llama.cpp:

# Install dependencies
python -m pip install gguf protobuf

# Convert HuggingFace model to GGUF
python convert-hf-to-gguf.py /path/to/model \
  --outfile /path/to/output/model.gguf \
  --outtype q4_k_m

Available outtypes: f32, f16, bf16, q8_0, q4_0, q4_1, q4_k_s, q4_k_m, q5_k_s, q5_k_m, q6_k

Where to Download Models

Pre-converted GGUF models ready to use.

TheBloke

The most popular source for quantized GGUF models. Hundreds of models including Llama 2, Mistral, CodeLlama, and more.

View on HuggingFace

NousResearch

Research-focused models including Hermes, Synthia, and other fine-tuned versions with GGUF support.

View on HuggingFace

LWDW (RunPod)

Specialized in large model variants (70B+) and unique quantizations. Great for GPU cloud inference.

View on HuggingFace

Popular Models to Try

mistral-7b-instruct

Fast, great quality

llama-2-7b/13b-chat

All-purpose, balanced

codellama-7b/13b

Code generation

neural-chat-7b

Conversations

Advanced Features

Unlock the full potential of llama.cpp.

Speculative Decoding

Use a smaller draft model to speed up token generation. Can achieve 2-3x speedup on supported hardware.

./main -m large_model.gguf --draft small_model.gguf -ngl 35 --draft 10

Grammar-Based Sampling

Force JSON output or specific formats using GBNF grammar files. Perfect for structured output.

./main -m model.gguf --grammar-file json.gbnf -p "Generate JSON:"

Continuous Batching

Process multiple prompts simultaneously in server mode for higher throughput in production environments.

LoRA Support

Load LoRA adapters on top of base models without merging. Hot-swap adapters at runtime.

./main -m base.gguf --lora adapter.bin --lora-scale 0.8

Troubleshooting

Common issues and solutions.

CUDA Out of Memory

Error: CUDA out of memory

Reduce GPU layers or use a smaller model. Try -ngl 20 instead of -ngl 35, or use a Q4_K_M quantized model instead of Q5.

Slow Token Generation

Model is running on CPU instead of GPU.

Ensure you built with CUDA/Metal support. Check nvidia-smi or Activity Monitor to verify GPU usage. Increase -ngl to offload more layers.

GGUF Format Errors

Error: invalid magic

Your llama.cpp version is too old for this GGUF file. Pull latest changes and rebuild, or download an older GGUF version (v1 or v2).

Model Output is Gibberish

Random characters or nonsensical output.

Usually indicates wrong tokenizer or incompatible model. Ensure you're using the correct prompt template for the model (e.g., ChatML for Mistral).

Community & Resources

Get help and stay updated.

What is llama.cpp?

Zero Dependencies

Optimized Inference

Multiple Models

Installation Guide

Option 1: Pre-built Releases (Windows)

Option 2: Build from Source

Mac/Linux

Windows (CMake)

With CUDA Support (GPU)

Option 3: Docker

Usage & Commands

Basic Commands

Simple Inference

Interactive Chat

Server Mode (API)

Important Flags

API Example (OpenAI-compatible)

Popular Forks & Projects

Ollama

TextGen WebUI

KoboldCpp

llamafile

LocalAI

LM Studio

Optimization Tips

Quantization Guide (GGUF)

Performance Tips

Hardware Requirements

Converting Models

Where to Download Models

TheBloke

NousResearch

LWDW (RunPod)

Popular Models to Try

Advanced Features

Speculative Decoding

Grammar-Based Sampling

Continuous Batching

LoRA Support

Troubleshooting

CUDA Out of Memory

Slow Token Generation

GGUF Format Errors

Model Output is Gibberish

Community & Resources

GitHub Discussions

Discord

r/LocalLLaMA

Documentation