Llama.cpp
llama-cpp-python is a Python binding for llama.cpp.
It supports inference for many LLMs models, which can be accessed on Hugging Face.
This notebook goes over how to run llama-cpp-python
within LangChain.
Note: new versions of llama-cpp-python
use GGUF model files (see here).
This is a breaking change.
To convert existing GGML models to GGUF you can run the following in llama.cpp:
python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 --input models/openorca-platypus2-13b.ggmlv3.q4_0.bin --output models/openorca-platypus2-13b.gguf.q4_0.bin
Installationโ
There are different options on how to install the llama-cpp package:
- CPU usage
- CPU + GPU (using one of many BLAS backends)
- Metal GPU (MacOS with Apple Silicon Chip)
CPU only installationโ
%pip install --upgrade --quiet llama-cpp-python
Installation with OpenBLAS / cuBLAS / CLBlastโ
llama.cpp
supports multiple BLAS backends for faster processing. Use the FORCE_CMAKE=1
environment variable to force the use of cmake and install the pip package for the desired BLAS backend (source).
Example installation with cuBLAS backend:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
IMPORTANT: If you have already installed the CPU only version of the package, you need to reinstall it from scratch. Consider the following command:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir