No description
Find a file
2023-06-26 23:40:48 -04:00
.github Revert "Merge pull request #350 from abetlen/migrate-to-scikit-build-core" 2023-06-09 23:23:16 -04:00
docker More README.md corrections and cleanup 2023-06-02 11:08:59 +00:00
docs Update readthedocs setup 2023-06-26 16:31:16 -04:00
examples Merge pull request #265 from dmahurin/fix-from-bytes-byteorder 2023-05-26 12:53:06 -04:00
llama_cpp Merge branch 'main' of github.com:abetlen/llama_cpp_python into main 2023-06-26 08:50:48 -04:00
tests Fix llama_cpp and Llama type signatures. Closes #221 2023-05-19 11:59:33 -04:00
vendor Update llama.cpp 2023-06-26 08:50:38 -04:00
.dockerignore Add dockerignore 2023-05-02 00:55:34 -04:00
.gitignore Update makefile and gitignore 2023-06-10 18:17:34 -04:00
.gitmodules make git module accessible anonymously 2023-05-20 02:25:59 +01:00
.readthedocs.yaml Update readthedocs setup 2023-06-26 16:31:16 -04:00
CHANGELOG.md Bump version 2023-06-26 08:53:54 -04:00
CMakeLists.txt Add resource destination to cmake 2023-06-10 18:11:48 -04:00
LICENSE.md Initial commit 2023-03-23 05:33:06 -04:00
Makefile Update makefile and gitignore 2023-06-10 18:17:34 -04:00
mkdocs.yml Add search to mkdocs 2023-03-31 00:01:53 -04:00
poetry.lock Bump pytest from 7.3.2 to 7.4.0 2023-06-26 23:26:11 +00:00
poetry.toml poetry: add poetry.toml, configure to install in a virtualenv 2023-05-09 16:03:19 -07:00
pyproject.toml Bump pytest from 7.3.2 to 7.4.0 2023-06-26 23:26:11 +00:00
README.md Updated docs link 2023-06-26 16:35:38 -04:00
setup.py Bump version 2023-06-26 08:53:54 -04:00

🦙 Python Bindings for llama.cpp

Documentation Status Tests PyPI PyPI - Python Version PyPI - License PyPI - Downloads

Simple Python bindings for @ggerganov's llama.cpp library. This package provides:

  • Low-level access to C API via ctypes interface.
  • High-level Python API for text completion
    • OpenAI-like API
    • LangChain compatibility

Documentation is available at https://llama-cpp-python.readthedocs.io/en/latest.

Install from PyPI (requires a c compiler):

pip install llama-cpp-python

The above command will attempt to install the package and build llama.cpp from source. This is the recommended installation method as it ensures that llama.cpp is built with the available optimizations for your system.

If you have previously installed llama-cpp-python through pip and want to upgrade your version or rebuild the package with different compiler options, please add the following flags to ensure that the package is rebuilt correctly:

pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh

Otherwise, while installing it will build the llama.ccp x86 version which will be 10x slower on Apple Silicon (M1) Mac.

Installation with OpenBLAS / cuBLAS / CLBlast / Metal

llama.cpp supports multiple BLAS backends for faster processing. Use the FORCE_CMAKE=1 environment variable to force the use of cmake and install the pip package for the desired BLAS backend.

To install with OpenBLAS, set the LLAMA_OPENBLAS=1 environment variable before installing:

CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

To install with cuBLAS, set the LLAMA_CUBLAS=1 environment variable before installing:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

To install with CLBlast, set the LLAMA_CLBLAST=1 environment variable before installing:

CMAKE_ARGS="-DLLAMA_CLBLAST=on" FORCE_CMAKE=1 pip install llama-cpp-python

To install with Metal (MPS), set the LLAMA_METAL=on environment variable before installing:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

Detailed MacOS Metal GPU install documentation is available at docs/macos_install.md

High-level API

The high-level API provides a simple managed interface through the Llama class.

Below is a short example demonstrating how to use the high-level API to generate text:

>>> from llama_cpp import Llama
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
>>> print(output)
{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/ggml-model.bin",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}

Web Server

llama-cpp-python offers a web server which aims to act as a drop-in replacement for the OpenAI API. This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).

To install the server package and get started:

pip install llama-cpp-python[server]
python3 -m llama_cpp.server --model models/7B/ggml-model.bin

Navigate to http://localhost:8000/docs to see the OpenAPI documentation.

Docker image

A Docker image is available on GHCR. To run the server:

docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/ggml-model-name.bin ghcr.io/abetlen/llama-cpp-python:latest

Low-level API

The low-level API is a direct ctypes binding to the C API provided by llama.cpp. The entire lowe-level API can be found in llama_cpp/llama_cpp.py and directly mirrors the C API in llama.h.

Below is a short example demonstrating how to use the low-level API to tokenize a prompt:

>>> import llama_cpp
>>> import ctypes
>>> params = llama_cpp.llama_context_default_params()
# use bytes for char * params
>>> ctx = llama_cpp.llama_init_from_file(b"./models/7b/ggml-model.bin", params)
>>> max_tokens = params.n_ctx
# use ctypes arrays for array params
>>> tokens = (llama_cpp.llama_token * int(max_tokens))()
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
>>> llama_cpp.llama_free(ctx)

Check out the examples folder for more examples of using the low-level API.

Documentation

Documentation is available at https://abetlen.github.io/llama-cpp-python. If you find any issues with the documentation, please open an issue or submit a PR.

Development

This package is under active development and I welcome any contributions.

To get started, clone the repository and install the package in development mode:

git clone --recurse-submodules git@github.com:abetlen/llama-cpp-python.git

# Install with pip
pip install -e .

# if you want to use the fastapi / openapi server
pip install -e .[server]

# If you're a poetry user, installing will also include a virtual environment
poetry install --all-extras
. .venv/bin/activate

# Will need to be re-run any time vendor/llama.cpp is updated
python3 setup.py develop

How does this compare to other Python bindings of llama.cpp?

I originally wrote this package for my own use with two goals in mind:

  • Provide a simple process to install llama.cpp and access the full C API in llama.h from Python
  • Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use llama.cpp

Any contributions and changes to this package will be made with these goals in mind.

License

This project is licensed under the terms of the MIT license.