2023-09-12 23:02:30 +00:00
# 🦙 Python Bindings for [`llama.cpp`](https://github.com/ggerganov/llama.cpp)
2023-03-24 04:06:24 +00:00
2023-06-26 20:35:38 +00:00
[![Documentation Status ](https://readthedocs.org/projects/llama-cpp-python/badge/?version=latest )](https://llama-cpp-python.readthedocs.io/en/latest/?badge=latest)
2023-04-05 08:41:24 +00:00
[![Tests ](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml/badge.svg?branch=main )](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml)
2023-03-24 04:06:24 +00:00
[![PyPI ](https://img.shields.io/pypi/v/llama-cpp-python )](https://pypi.org/project/llama-cpp-python/)
[![PyPI - Python Version ](https://img.shields.io/pypi/pyversions/llama-cpp-python )](https://pypi.org/project/llama-cpp-python/)
[![PyPI - License ](https://img.shields.io/pypi/l/llama-cpp-python )](https://pypi.org/project/llama-cpp-python/)
[![PyPI - Downloads ](https://img.shields.io/pypi/dm/llama-cpp-python )](https://pypi.org/project/llama-cpp-python/)
2023-03-23 09:33:06 +00:00
2023-03-24 03:55:42 +00:00
Simple Python bindings for ** @ggerganov 's** [`llama.cpp` ](https://github.com/ggerganov/llama.cpp ) library.
2023-03-23 20:00:10 +00:00
This package provides:
2023-03-23 09:33:06 +00:00
2023-03-24 03:55:42 +00:00
- Low-level access to C API via `ctypes` interface.
- High-level Python API for text completion
2024-02-22 07:30:24 +00:00
- OpenAI-like API
- [LangChain compatibility ](https://python.langchain.com/docs/integrations/llms/llamacpp )
- [LlamaIndex compatibility ](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html )
2023-11-22 23:21:02 +00:00
- OpenAI compatible web server
2024-02-22 07:30:24 +00:00
- [Local Copilot replacement ](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion )
- [Function Calling support ](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling )
- [Vision API support ](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models )
- [Multiple Models ](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support )
2023-12-22 19:40:44 +00:00
2023-06-26 20:35:38 +00:00
Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest ](https://llama-cpp-python.readthedocs.io/en/latest ).
2023-05-17 15:40:12 +00:00
2023-11-28 07:37:34 +00:00
## Installation
2023-03-23 09:33:06 +00:00
2023-11-28 08:15:01 +00:00
`llama-cpp-python` can be installed directly from PyPI as a source distribution by running:
2023-03-23 09:33:06 +00:00
```bash
2023-03-23 18:24:34 +00:00
pip install llama-cpp-python
2023-03-23 09:33:06 +00:00
```
2023-11-28 08:15:01 +00:00
This will build `llama.cpp` from source using cmake and your system's c compiler (required) and install the library alongside this python package.
2023-04-28 21:08:18 +00:00
2023-11-28 08:15:01 +00:00
If you run into issues during installation add the `--verbose` flag to the `pip install` command to see the full cmake build log.
### Installation with Specific Hardware Acceleration (BLAS, CUDA, Metal, etc)
The default pip install behaviour is to build `llama.cpp` for CPU only on Linux and Windows and use Metal on MacOS.
`llama.cpp` supports a number of hardware acceleration backends depending including OpenBLAS, cuBLAS, CLBlast, HIPBLAS, and Metal.
See the [llama.cpp README ](https://github.com/ggerganov/llama.cpp#build ) for a full list of supported backends.
All of these backends are supported by `llama-cpp-python` and can be enabled by setting the `CMAKE_ARGS` environment variable before installing.
On Linux and Mac you set the `CMAKE_ARGS` like this:
2023-05-19 06:20:41 +00:00
```bash
2023-11-28 08:15:01 +00:00
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
2023-05-19 06:20:41 +00:00
```
2023-11-28 08:15:01 +00:00
On Windows you can set the `CMAKE_ARGS` like this:
2023-05-07 09:20:04 +00:00
2023-11-28 08:15:01 +00:00
```ps
$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
pip install llama-cpp-python
```
2023-05-07 09:20:04 +00:00
2023-11-28 08:15:01 +00:00
#### OpenBLAS
2023-05-07 09:20:04 +00:00
2023-07-19 02:22:33 +00:00
To install with OpenBLAS, set the `LLAMA_BLAS and LLAMA_BLAS_VENDOR` environment variables before installing:
2023-05-07 09:20:04 +00:00
```bash
2023-09-13 03:56:10 +00:00
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
2023-05-07 09:20:04 +00:00
```
2023-11-28 08:15:01 +00:00
#### cuBLAS
2024-01-29 16:02:25 +00:00
To install with cuBLAS, set the `LLAMA_CUBLAS=on` environment variable before installing:
2023-05-07 09:20:04 +00:00
```bash
2023-09-13 03:56:10 +00:00
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
2023-05-07 09:20:04 +00:00
```
2023-11-28 08:15:01 +00:00
#### Metal
2023-05-07 09:20:04 +00:00
2023-06-10 22:59:26 +00:00
To install with Metal (MPS), set the `LLAMA_METAL=on` environment variable before installing:
```bash
2023-09-13 03:56:10 +00:00
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
2023-06-10 22:59:26 +00:00
```
2023-04-28 21:08:18 +00:00
2023-11-28 08:15:01 +00:00
#### CLBlast
2024-01-29 16:02:25 +00:00
To install with CLBlast, set the `LLAMA_CLBLAST=on` environment variable before installing:
2023-08-25 21:19:23 +00:00
```bash
2023-11-28 08:15:01 +00:00
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
2023-08-25 21:19:23 +00:00
```
2023-11-28 08:15:01 +00:00
#### hipBLAS
2023-07-18 21:14:42 +00:00
2023-11-28 08:15:01 +00:00
To install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on` environment variable before installing:
2023-07-18 21:14:42 +00:00
2023-11-28 08:15:01 +00:00
```bash
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
2023-07-18 21:14:42 +00:00
```
2024-01-29 16:01:26 +00:00
#### Vulkan
To install with Vulkan support, set the `LLAMA_VULKAN=on` environment variable before installing:
```bash
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
```
2024-01-30 17:23:07 +00:00
#### Kompute
To install with Kompute support, set the `LLAMA_KOMPUTE=on` environment variable before installing:
```bash
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
```
#### SYCL
To install with SYCL support, set the `LLAMA_SYCL=on` environment variable before installing:
```bash
2024-02-11 18:55:15 +00:00
source /opt/intel/oneapi/setvars.sh
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
2024-01-30 17:23:07 +00:00
```
2023-11-28 08:15:01 +00:00
### Windows Notes
2023-07-18 21:14:42 +00:00
2023-11-01 22:55:57 +00:00
If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo ](https://github.com/ggerganov/llama.cpp#openblas ) and add those manually to CMAKE_ARGS before running `pip` install:
2024-01-24 15:51:15 +00:00
2023-11-01 22:55:57 +00:00
```ps
$env:CMAKE_GENERATOR = "MinGW Makefiles"
2024-01-30 17:23:07 +00:00
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"
2023-11-01 22:55:57 +00:00
```
2023-07-18 21:14:42 +00:00
See the above instructions and set `CMAKE_ARGS` to the BLAS backend you want to use.
2023-11-28 08:15:01 +00:00
### MacOS Notes
2024-01-21 23:38:44 +00:00
Detailed MacOS Metal GPU install documentation is available at [docs/install/macos.md ](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/ )
#### M1 Mac Performance Issue
2023-11-28 08:15:01 +00:00
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
2024-01-24 15:51:15 +00:00
```bash
2023-11-28 08:15:01 +00:00
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
```
2024-01-24 15:51:15 +00:00
2023-11-28 08:15:01 +00:00
Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
2023-07-18 21:14:42 +00:00
2024-01-21 23:38:44 +00:00
#### M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`
Try installing with
2024-01-24 15:51:15 +00:00
```bash
2024-01-21 23:38:44 +00:00
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
```
2023-06-15 02:15:22 +00:00
2023-11-28 08:15:01 +00:00
### Upgrading and Reinstalling
To upgrade or rebuild `llama-cpp-python` add the following flags to ensure that the package is rebuilt correctly:
```bash
pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
```
This will ensure that all source files are re-built with the most recently set `CMAKE_ARGS` flags.
2023-04-05 21:44:25 +00:00
## High-level API
2023-03-23 09:33:06 +00:00
2023-11-27 23:54:07 +00:00
[API Reference ](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#high-level-api )
The high-level API provides a simple managed interface through the [`Llama` ](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama ) class.
2023-05-07 05:41:19 +00:00
2023-11-23 00:49:56 +00:00
Below is a short example demonstrating how to use the high-level API to for basic text completion:
2023-05-07 05:41:19 +00:00
2023-03-23 09:33:06 +00:00
```python
>>> from llama_cpp import Llama
2024-01-24 15:51:15 +00:00
>>> llm = Llama(
model_path="./models/7B/llama-model.gguf",
2024-01-30 17:23:07 +00:00
# n_gpu_layers=-1, # Uncomment to use GPU acceleration
2024-01-24 15:51:15 +00:00
# seed=1337, # Uncomment to set a specific seed
# n_ctx=2048, # Uncomment to increase the context window
)
2023-11-23 00:49:56 +00:00
>>> output = llm(
2023-11-27 23:29:13 +00:00
"Q: Name the planets in the solar system? A: ", # Prompt
2024-01-25 15:51:48 +00:00
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
2023-11-27 23:29:13 +00:00
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
2023-03-23 09:33:06 +00:00
>>> print(output)
{
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"object": "text_completion",
"created": 1679561337,
2023-09-14 18:48:21 +00:00
"model": "./models/7B/llama-model.gguf",
2023-03-23 09:33:06 +00:00
"choices": [
{
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
"index": 0,
"logprobs": None,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 28,
"total_tokens": 42
}
}
```
2023-03-24 04:06:24 +00:00
2023-11-27 23:54:07 +00:00
Text completion is available through the [`__call__` ](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__ ) and [`create_completion` ](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion ) methods of the [`Llama` ](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama ) class.
2024-02-21 21:25:10 +00:00
## Pulling models from Hugging Face
You can pull `Llama` models from Hugging Face using the `from_pretrained` method.
You'll need to install the `huggingface-hub` package to use this feature (`pip install huggingface-hub`).
```python
llama = Llama.from_pretrained(
repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
filename="*q8_0.gguf",
verbose=False
)
```
2023-11-23 00:49:56 +00:00
### Chat Completion
The high-level API also provides a simple interface for chat completion.
Note that `chat_format` option must be set for the particular model you are using.
```python
>>> from llama_cpp import Llama
2024-01-24 15:51:15 +00:00
>>> llm = Llama(
model_path="path/to/llama-2/llama-model.gguf",
chat_format="llama-2"
)
2023-11-23 00:49:56 +00:00
>>> llm.create_chat_completion(
2023-11-27 23:29:13 +00:00
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": "Describe this image in detail please."
}
]
2023-11-23 00:49:56 +00:00
)
```
2023-11-27 23:54:07 +00:00
Chat completion is available through the [`create_chat_completion` ](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion ) method of the [`Llama` ](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama ) class.
2024-01-28 00:36:33 +00:00
### JSON and JSON Schema Mode
If you want to constrain chat responses to only valid JSON or a specific JSON Schema you can use the `response_format` argument to the `create_chat_completion` method.
2024-01-28 00:37:59 +00:00
#### JSON Mode
2024-01-28 00:36:33 +00:00
The following example will constrain the response to be valid JSON.
```python
>>> from llama_cpp import Llama
>>> llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
>>> llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a helpful assistant that outputs in JSON.",
},
{"role": "user", "content": "Who won the world series in 2020"},
],
response_format={
"type": "json_object",
},
temperature=0.7,
)
```
2024-01-28 00:37:59 +00:00
#### JSON Schema Mode
2024-01-28 00:36:33 +00:00
To constrain the response to a specific JSON Schema, you can use the `schema` property of the `response_format` argument.
```python
>>> from llama_cpp import Llama
>>> llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
>>> llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a helpful assistant that outputs in JSON.",
},
{"role": "user", "content": "Who won the world series in 2020"},
],
response_format={
"type": "json_object",
"schema": {
"type": "object",
"properties": {"team_name": {"type": "string"}},
"required": ["team_name"],
},
},
temperature=0.7,
)
```
2023-11-23 00:49:56 +00:00
### Function Calling
2024-02-13 07:04:54 +00:00
The high-level API also provides a simple interface for function calling. This is possible through the `functionary` pre-trained models chat format or through the generic `chatml-function-calling` chat format.
2023-11-23 00:49:56 +00:00
2024-02-12 21:27:43 +00:00
The gguf-converted files for functionary can be found here: [functionary-7b-v1 ](https://huggingface.co/abetlen/functionary-7b-v1-GGUF )
2023-11-23 00:49:56 +00:00
```python
2024-02-12 21:27:43 +00:00
>>> from llama_cpp import Llama
>>> llm = Llama(model_path="path/to/functionary/llama-model.gguf", chat_format="functionary")
>>> # or
>>> llm = Llama(model_path="path/to/chatml/llama-model.gguf", chat_format="chatml-function-calling")
2023-11-23 00:49:56 +00:00
>>> llm.create_chat_completion(
2023-11-27 23:29:13 +00:00
messages = [
2024-02-12 21:27:43 +00:00
{
"role": "system",
"content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"
},
2023-11-27 23:29:13 +00:00
{
"role": "user",
"content": "Extract Jason is 25 years old"
}
],
tools=[{
"type": "function",
"function": {
"name": "UserDetail",
"parameters": {
2023-12-17 00:00:30 +00:00
"type": "object",
2023-11-27 23:29:13 +00:00
"title": "UserDetail",
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"age": {
"title": "Age",
"type": "integer"
}
2023-11-23 00:49:56 +00:00
},
2023-11-27 23:29:13 +00:00
"required": [ "name", "age" ]
}
}
}],
2024-02-12 21:27:43 +00:00
tool_choice=[{
2023-11-27 23:29:13 +00:00
"type": "function",
"function": {
"name": "UserDetail"
2023-11-23 00:49:56 +00:00
}
2024-02-12 21:27:43 +00:00
}]
2023-11-23 00:49:56 +00:00
)
```
### Multi-modal Models
`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
read information from both text and images.
You'll first need to download one of the available multi-modal models in GGUF format:
- [llava-v1.5-7b ](https://huggingface.co/mys/ggml_llava-v1.5-7b )
- [llava-v1.5-13b ](https://huggingface.co/mys/ggml_llava-v1.5-13b )
- [bakllava-1-7b ](https://huggingface.co/mys/ggml_bakllava-1 )
Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.
```python
>>> from llama_cpp import Llama
>>> from llama_cpp.llama_chat_format import Llava15ChatHandler
>>> chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")
2023-11-23 02:07:00 +00:00
>>> llm = Llama(
model_path="./path/to/llava/llama-model.gguf",
chat_handler=chat_handler,
2023-12-12 01:41:38 +00:00
n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
logits_all=True,# needed to make llava work
2023-11-23 02:07:00 +00:00
)
2023-11-23 00:49:56 +00:00
>>> llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://.../image.png"}},
{"type" : "text", "text": "Describe this image in detail please."}
]
}
]
)
```
2024-01-31 19:08:14 +00:00
### Speculative Decoding
`llama-cpp-python` supports speculative decoding which allows the model to generate completions based on a draft model.
The fastest way to use speculative decoding is through the `LlamaPromptLookupDecoding` class.
Just pass this as a draft model to the `Llama` class during initialization.
```python
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
llama = Llama(
model_path="path/to/model.gguf",
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)
```
2024-02-16 04:15:50 +00:00
### Embeddings
`llama-cpp-python` supports generating embeddings from the text.
```python
import llama_cpp
llm = llama_cpp.Llama(model_path="path/to/model.gguf", embeddings=True)
embeddings = llm.create_embedding("Hello, world!")
# or batched
embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])
```
2023-07-08 03:24:57 +00:00
### Adjusting the Context Window
2023-11-22 23:09:31 +00:00
2023-07-08 03:24:57 +00:00
The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.
For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:
```python
2023-09-14 18:48:21 +00:00
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
2023-07-08 03:24:57 +00:00
```
2023-11-28 07:37:34 +00:00
## OpenAI Compatible Web Server
2023-04-05 21:44:25 +00:00
`llama-cpp-python` offers a web server which aims to act as a drop-in replacement for the OpenAI API.
This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).
To install the server package and get started:
```bash
pip install llama-cpp-python[server]
2023-09-14 18:48:21 +00:00
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
```
2023-11-21 05:24:22 +00:00
2023-09-14 18:48:21 +00:00
Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:
```bash
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
2023-05-05 12:21:57 +00:00
```
2023-04-05 21:44:25 +00:00
Navigate to [http://localhost:8000/docs ](http://localhost:8000/docs ) to see the OpenAPI documentation.
2023-11-21 05:24:22 +00:00
To bind to `0.0.0.0` to enable remote connections, use `python3 -m llama_cpp.server --host 0.0.0.0` .
Similarly, to change the port (default is 8000), use `--port` .
2023-09-14 18:48:21 +00:00
2023-11-22 11:10:03 +00:00
You probably also want to set the prompt format. For chatml, use
```bash
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml
```
That will format the prompt according to how model expects it. You can find the prompt format in the model card.
For possible options, see [llama_cpp/llama_chat_format.py ](llama_cpp/llama_chat_format.py ) and look for lines starting with "@register_chat_format".
2023-11-28 07:37:34 +00:00
### Web Server Features
2023-11-27 23:54:07 +00:00
- [Local Copilot replacement ](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion )
- [Function Calling support ](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling )
- [Vision API support ](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models )
2023-12-22 19:40:13 +00:00
- [Multiple Models ](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support )
2023-11-27 23:54:07 +00:00
2023-04-12 09:53:39 +00:00
## Docker image
A Docker image is available on [GHCR ](https://ghcr.io/abetlen/llama-cpp-python ). To run the server:
```bash
2023-09-14 18:48:21 +00:00
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
2023-04-12 09:53:39 +00:00
```
2024-01-30 17:23:07 +00:00
[Docker on termux (requires root) ](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27 ) is currently the only known way to run this on phones, see [termux support issue ](https://github.com/abetlen/llama-cpp-python/issues/389 )
2023-04-12 09:53:39 +00:00
2023-04-05 21:44:25 +00:00
## Low-level API
2023-11-27 23:54:07 +00:00
[API Reference ](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#low-level-api )
2023-05-07 05:41:19 +00:00
The low-level API is a direct [`ctypes` ](https://docs.python.org/3/library/ctypes.html ) binding to the C API provided by `llama.cpp` .
2023-08-05 07:00:04 +00:00
The entire low-level API can be found in [llama_cpp/llama_cpp.py ](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py ) and directly mirrors the C API in [llama.h ](https://github.com/ggerganov/llama.cpp/blob/master/llama.h ).
2023-05-07 05:41:19 +00:00
Below is a short example demonstrating how to use the low-level API to tokenize a prompt:
```python
>>> import llama_cpp
>>> import ctypes
2023-09-14 03:00:43 +00:00
>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
2023-05-07 05:41:19 +00:00
>>> params = llama_cpp.llama_context_default_params()
# use bytes for char * params
2023-09-14 18:48:21 +00:00
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
2023-09-07 21:50:47 +00:00
>>> ctx = llama_cpp.llama_new_context_with_model(model, params)
2023-05-07 05:41:19 +00:00
>>> max_tokens = params.n_ctx
# use ctypes arrays for array params
2023-05-15 21:52:25 +00:00
>>> tokens = (llama_cpp.llama_token * int(max_tokens))()
2023-05-07 05:41:19 +00:00
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
>>> llama_cpp.llama_free(ctx)
```
Check out the [examples folder ](examples/low_level_api ) for more examples of using the low-level API.
2023-04-05 21:44:25 +00:00
2023-11-28 07:37:34 +00:00
## Documentation
2023-04-03 01:03:39 +00:00
2023-11-27 23:21:00 +00:00
Documentation is available via [https://llama-cpp-python.readthedocs.io/ ](https://llama-cpp-python.readthedocs.io/ ).
2023-04-03 01:03:39 +00:00
If you find any issues with the documentation, please open an issue or submit a PR.
2023-11-28 07:37:34 +00:00
## Development
2023-04-03 01:03:39 +00:00
This package is under active development and I welcome any contributions.
2023-07-19 00:26:25 +00:00
To get started, clone the repository and install the package in editable / development mode:
2023-04-03 01:03:39 +00:00
```bash
2023-08-02 10:27:08 +00:00
git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
2023-06-30 08:42:13 +00:00
cd llama-cpp-python
2023-05-01 06:28:50 +00:00
2023-07-19 00:26:25 +00:00
# Upgrade pip (required for editable mode)
pip install --upgrade pip
2023-05-01 06:28:50 +00:00
# Install with pip
pip install -e .
# if you want to use the fastapi / openapi server
pip install -e .[server]
2023-07-19 00:26:25 +00:00
# to install all optional dependencies
pip install -e .[all]
# to clear the local build cache
make clean
2023-04-03 01:03:39 +00:00
```
2024-01-25 15:51:48 +00:00
You can also test out specific commits of `lama.cpp` by checking out the desired commit in the `vendor/llama.cpp` submodule and then running `make clean` and `pip install -e .` again. Any changes in the `llama.h` API will require
changes to the `llama_cpp/llama_cpp.py` file to match the new API (additional changes may be required elsewhere).
2023-11-28 07:37:34 +00:00
## FAQ
### Are there pre-built binaries / binary wheels available?
The recommended installation method is to install from source as described above.
The reason for this is that `llama.cpp` is built with compiler optimizations that are specific to your system.
Using pre-built binaries would require disabling these optimizations or supporting a large number of pre-built binaries for each platform.
That being said there are some pre-built binaries available through the Releases as well as some community provided wheels.
In the future, I would like to provide pre-built binaries and wheels for common platforms and I'm happy to accept any useful contributions in this area.
2023-11-28 08:15:01 +00:00
This is currently being tracked in [#741 ](https://github.com/abetlen/llama-cpp-python/issues/741 )
2023-11-28 07:37:34 +00:00
### How does this compare to other Python bindings of `llama.cpp`?
2023-04-03 01:03:39 +00:00
2023-04-04 14:57:22 +00:00
I originally wrote this package for my own use with two goals in mind:
2023-04-03 01:03:39 +00:00
- Provide a simple process to install `llama.cpp` and access the full C API in `llama.h` from Python
- Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use `llama.cpp`
Any contributions and changes to this package will be made with these goals in mind.
2023-11-28 07:37:34 +00:00
## License
2023-03-24 04:06:24 +00:00
This project is licensed under the terms of the MIT license.