Your Ultimate Guide to Running LLMs Locally
Complete documentation, installation guides, optimization tips, and forks comparison for the most efficient C++ implementation of LLaMA and other large language models.
A high-performance C++ port of Facebook's LLaMA model, enabling efficient inference on consumer hardware.
Pure C++ implementation with no dependencies. Just compile and run on CPU or GPU.
Highly optimized for 4-bit and 5-bit quantized models. Supports GPU acceleration via CUDA, Metal, and Vulkan.
Supports LLaMA, LLaMA 2, Falcon, Wizard, Vicuna, and many more GGUF models.
Multiple ways to get llama.cpp running on your machine.
# Download from:
https://github.com/ggerganov/llama.cpp/releases
# Look for: llama-b[BUILD]-bin-win-[ARCH]-[BUILD_TYPE].zip
# llama-b[BUILD]-macOS-[ARCH].zip
# llama-b[BUILD]-bin-ubuntu-[ARCH].[EXT]
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp make
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake . cmake --build . --config Release
make LLAMA_CUDA=1 # or for Windows cmake -DLLAMA_CUDA=ON . cmake --build . --config Release
docker pull ghcr.io/ggerganov/llama.cpp:latest docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:latest --api -m /models/your-model.gguf
Master the CLI and server modes.
./main -m models/model.gguf -p "Your prompt here"
./main -m models/model.gguf --interactive-server
./server -m models/model.gguf -c 2048 --port 8080
-c, --ctx_size
Context size (e.g., 2048, 4096, 8192)
-n, --n_predict
Number of tokens to predict (-1 = infinity)
-t, --threads
Number of CPU threads (recommend: physical cores)
--temp
Temperature (0.8 is standard)
--gpu_layers/-ngl
Number of layers to offload to GPU
# Start server
./server -m models/llama-2-7b-chat.Q4_K_M.gguf -c 4096
# Send request
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
}'
Specialized versions and wrappers built on llama.cpp
The easiest way to run LLaMA, Mistral, and other models locally. Provides a CLI and API for running models.
A web interface for LLMs. Supports multiple backends including llama.cpp with extensive character/persona features.
A user-friendly wrapper for llama.cpp optimized for story writing and text adventure gaming.
Mozilla's project. LLMs packaged as single executable files that run on most computers without dependencies.
Drop-in OpenAI API replacement. Self-hosted with llama.cpp backend. Supports text generation, images, and audio.
Desktop application for running local LLMs with a beautiful interface. Easy model downloading and chatting.
Get the best performance from your models.
Recommended for most models. ~4.7GB for 7B model. Quality slightly better than Q4_0.
For quality-critical tasks. ~5.8GB for 7B model.
Almost unnoticeable loss. ~7GB for 7B model.
-ngl 35 on macOS (Metal) or -ngl 33 on NVIDIA (CUDA) for 7B models
-t [cores]
--mlock to keep model in RAM
-b 512 for higher throughput
Converting Safetensors/PyTorch models to GGUF format for llama.cpp:
# Install dependencies python -m pip install gguf protobuf # Convert HuggingFace model to GGUF python convert-hf-to-gguf.py /path/to/model \ --outfile /path/to/output/model.gguf \ --outtype q4_k_m
Available outtypes: f32, f16, bf16, q8_0, q4_0, q4_1, q4_k_s, q4_k_m, q5_k_s, q5_k_m, q6_k
Pre-converted GGUF models ready to use.
The most popular source for quantized GGUF models. Hundreds of models including Llama 2, Mistral, CodeLlama, and more.
View on HuggingFaceResearch-focused models including Hermes, Synthia, and other fine-tuned versions with GGUF support.
View on HuggingFaceSpecialized in large model variants (70B+) and unique quantizations. Great for GPU cloud inference.
View on HuggingFacemistral-7b-instruct
Fast, great quality
llama-2-7b/13b-chat
All-purpose, balanced
codellama-7b/13b
Code generation
neural-chat-7b
Conversations
Unlock the full potential of llama.cpp.
Use a smaller draft model to speed up token generation. Can achieve 2-3x speedup on supported hardware.
./main -m large_model.gguf --draft small_model.gguf -ngl 35 --draft 10
Force JSON output or specific formats using GBNF grammar files. Perfect for structured output.
./main -m model.gguf --grammar-file json.gbnf -p "Generate JSON:"
Process multiple prompts simultaneously in server mode for higher throughput in production environments.
Load LoRA adapters on top of base models without merging. Hot-swap adapters at runtime.
./main -m base.gguf --lora adapter.bin --lora-scale 0.8
Common issues and solutions.
Error: CUDA out of memory
Reduce GPU layers or use a smaller model. Try -ngl 20 instead of -ngl 35, or use a Q4_K_M quantized model instead of Q5.
Model is running on CPU instead of GPU.
Ensure you built with CUDA/Metal support. Check nvidia-smi or Activity Monitor to verify GPU usage. Increase -ngl to offload more layers.
Error: invalid magic
Your llama.cpp version is too old for this GGUF file. Pull latest changes and rebuild, or download an older GGUF version (v1 or v2).
Random characters or nonsensical output.
Usually indicates wrong tokenizer or incompatible model. Ensure you're using the correct prompt template for the model (e.g., ChatML for Mistral).
Get help and stay updated.