llama.cpp/README.md

# 🦙 Python Bindings for [`llama.cpp`](https://github.com/ggerganov/llama.cpp)

[![Documentation Status](https://readthedocs.org/projects/llama-cpp-python/badge/?version=latest)](https://llama-cpp-python.readthedocs.io/en/latest/?badge=latest)
[![Tests](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml/badge.svg?branch=main)](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml)
[![PyPI](https://img.shields.io/pypi/v/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)
[![PyPI - License](https://img.shields.io/pypi/l/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)

Simple Python bindings for **@ggerganov's** [`llama.cpp`](https://github.com/ggerganov/llama.cpp) library.
This package provides:

- Low-level access to C API via `ctypes` interface.
- High-level Python API for text completion
    - OpenAI-like API
    - [LangChain compatibility](https://python.langchain.com/docs/integrations/llms/llamacpp)
- OpenAI compatible web server
    - [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)
    - [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)
    - [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)

Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).


## Installation from PyPI

Install from PyPI (requires a c compiler):

```bash
pip install llama-cpp-python
```

The above command will attempt to install the package and build `llama.cpp` from source.
This is the recommended installation method as it ensures that `llama.cpp` is built with the available optimizations for your system.

If you have previously installed `llama-cpp-python` through pip and want to upgrade your version or rebuild the package with different  compiler options, please add the following flags to ensure that the package is rebuilt correctly:

```bash
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
```

Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
```
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
```
Otherwise, while installing it will build the llama.ccp x86 version which will be 10x slower on Apple Silicon (M1) Mac.

### Installation with Hardware Acceleration

`llama.cpp` supports multiple BLAS backends for faster processing.

To install with OpenBLAS, set the `LLAMA_BLAS and LLAMA_BLAS_VENDOR` environment variables before installing:

```bash
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
```

To install with cuBLAS, set the `LLAMA_CUBLAS=1` environment variable before installing:

```bash
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
```

To install with CLBlast, set the `LLAMA_CLBLAST=1` environment variable before installing:

```bash
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
```

To install with Metal (MPS), set the `LLAMA_METAL=on` environment variable before installing:

```bash
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
```

To install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on` environment variable before installing:

```bash
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
```

#### Windows remarks

To set the variables `CMAKE_ARGS`in PowerShell, follow the next steps (Example using, OpenBLAS):

```ps
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
```

Then, call `pip` after setting the variables:
```
pip install llama-cpp-python
```

If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:
```ps
$env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe" 
```

See the above instructions and set `CMAKE_ARGS` to the BLAS backend you want to use.

#### MacOS remarks

Detailed MacOS Metal GPU install documentation is available at [docs/install/macos.md](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/)

## High-level API

The high-level API provides a simple managed interface through the `Llama` class.

Below is a short example demonstrating how to use the high-level API to generate text:

```python
>>> from llama_cpp import Llama
>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
>>> print(output)
{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/llama-model.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}
```

### Adjusting the Context Window

The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.

For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:

```python
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
```


## Web Server

`llama-cpp-python` offers a web server which aims to act as a drop-in replacement for the OpenAI API.
This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).

To install the server package and get started:

```bash
pip install llama-cpp-python[server]
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
```

Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:

```bash
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
```

Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.

To bind to `0.0.0.0` to enable remote connections, use `python3 -m llama_cpp.server --host 0.0.0.0`.
Similarly, to change the port (default is 8000), use `--port`.

## Docker image

A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:

```bash
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
```
[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389) 

## Low-level API

The low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.
The entire low-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).

Below is a short example demonstrating how to use the low-level API to tokenize a prompt:

```python
>>> import llama_cpp
>>> import ctypes
>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
>>> params = llama_cpp.llama_context_default_params()
# use bytes for char * params
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
>>> ctx = llama_cpp.llama_new_context_with_model(model, params)
>>> max_tokens = params.n_ctx
# use ctypes arrays for array params
>>> tokens = (llama_cpp.llama_token * int(max_tokens))()
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
>>> llama_cpp.llama_free(ctx)
```

Check out the [examples folder](examples/low_level_api) for more examples of using the low-level API.


# Documentation

Documentation is available at [https://abetlen.github.io/llama-cpp-python](https://abetlen.github.io/llama-cpp-python).
If you find any issues with the documentation, please open an issue or submit a PR.

# Development

This package is under active development and I welcome any contributions.

To get started, clone the repository and install the package in editable / development mode:

```bash
git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python

# Upgrade pip (required for editable mode)
pip install --upgrade pip

# Install with pip
pip install -e .

# if you want to use the fastapi / openapi server
pip install -e .[server]

# to install all optional dependencies
pip install -e .[all]

# to clear the local build cache
make clean
```

# How does this compare to other Python bindings of `llama.cpp`?

I originally wrote this package for my own use with two goals in mind:

- Provide a simple process to install `llama.cpp` and access the full C API in `llama.h` from Python
- Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use `llama.cpp`

Any contributions and changes to this package will be made with these goals in mind.

# License

This project is licensed under the terms of the MIT license.
Update title 2023-09-12 23:02:30 +00:00			# 🦙 Python Bindings for [`llama.cpp`](https://github.com/ggerganov/llama.cpp)
Update README.md 2023-03-24 04:06:24 +00:00
Updated docs link 2023-06-26 20:35:38 +00:00			`[![Documentation Status](https://readthedocs.org/projects/llama-cpp-python/badge/?version=latest)](https://llama-cpp-python.readthedocs.io/en/latest/?badge=latest)`
Update workflow name and add badge to README 2023-04-05 08:41:24 +00:00			`[![Tests](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml/badge.svg?branch=main)](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml)`
Update README.md 2023-03-24 04:06:24 +00:00			`[![PyPI](https://img.shields.io/pypi/v/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)`
			`[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)`
			`[![PyPI - License](https://img.shields.io/pypi/l/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)`
			`[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)`
Initial commit 2023-03-23 09:33:06 +00:00
Update README.md 2023-03-24 03:55:42 +00:00			Simple Python bindings for @ggerganov's [`llama.cpp`](https://github.com/ggerganov/llama.cpp) library.
Update README 2023-03-23 20:00:10 +00:00			`This package provides:`
Initial commit 2023-03-23 09:33:06 +00:00
Update README.md 2023-03-24 03:55:42 +00:00			- Low-level access to C API via `ctypes` interface.
			`- High-level Python API for text completion`
docs: minor indentation fix 2023-11-22 23:04:18 +00:00			`- OpenAI-like API`
docs: Link to langchain docs 2023-11-22 23:17:49 +00:00			`- [LangChain compatibility](https://python.langchain.com/docs/integrations/llms/llamacpp)`
docs: Add links to server functionality 2023-11-22 23:21:02 +00:00			`- OpenAI compatible web server`
			`- [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)`
			`- [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)`
			`- [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)`
Update README.md 2023-03-24 04:06:24 +00:00
Updated docs link 2023-06-26 20:35:38 +00:00			`Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).`
Move docs link up 2023-05-17 15:40:12 +00:00
Update README 2023-08-25 09:02:48 +00:00
Update README.md add link to main README>md 2023-06-12 23:52:22 +00:00
Add ROCm / AMD instructions to docs 2023-08-25 21:19:23 +00:00			`## Installation from PyPI`
Initial commit 2023-03-23 09:33:06 +00:00
Update README 2023-04-28 21:12:03 +00:00			`Install from PyPI (requires a c compiler):`
Initial commit 2023-03-23 09:33:06 +00:00
			```bash
Update pip instructions in readme 2023-03-23 18:24:34 +00:00			`pip install llama-cpp-python`
Initial commit 2023-03-23 09:33:06 +00:00			```

Update README.md Fixes typo in README 2023-06-13 05:56:05 +00:00			The above command will attempt to install the package and build `llama.cpp` from source.
Update README 2023-04-28 21:08:18 +00:00			This is the recommended installation method as it ensures that `llama.cpp` is built with the available optimizations for your system.

Add upgrade instructions to the README 2023-05-19 06:20:41 +00:00			If you have previously installed `llama-cpp-python` through pip and want to upgrade your version or rebuild the package with different compiler options, please add the following flags to ensure that the package is rebuilt correctly:

			```bash
			`pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir`
			```

chore: add note for Mac m1 installation 2023-05-15 10:46:59 +00:00			`Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:`
			```
			`wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh`
			`bash Miniforge3-MacOSX-arm64.sh`
			```
			`Otherwise, while installing it will build the llama.ccp x86 version which will be 10x slower on Apple Silicon (M1) Mac.`
Update README 2023-05-07 09:20:04 +00:00
Add ROCm / AMD instructions to docs 2023-08-25 21:19:23 +00:00			`### Installation with Hardware Acceleration`
Update README 2023-05-07 09:20:04 +00:00
			`llama.cpp` supports multiple BLAS backends for faster processing.

Update install instructions for Linux OpenBLAS The instructions are different than they used to be. Source: https://github.com/ggerganov/llama.cpp#openblas 2023-07-19 02:22:33 +00:00			To install with OpenBLAS, set the `LLAMA_BLAS and LLAMA_BLAS_VENDOR` environment variables before installing:
Update README 2023-05-07 09:20:04 +00:00
			```bash
Remove reference to FORCE_CMAKE from docs 2023-09-13 03:56:10 +00:00			`CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python`
Update README 2023-05-07 09:20:04 +00:00			```

			To install with cuBLAS, set the `LLAMA_CUBLAS=1` environment variable before installing:

			```bash
Remove reference to FORCE_CMAKE from docs 2023-09-13 03:56:10 +00:00			`CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python`
Update README 2023-05-07 09:20:04 +00:00			```

			To install with CLBlast, set the `LLAMA_CLBLAST=1` environment variable before installing:

			```bash
Remove reference to FORCE_CMAKE from docs 2023-09-13 03:56:10 +00:00			`CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python`
Update README 2023-05-07 09:20:04 +00:00			```

Update README.md 2023-06-10 22:59:26 +00:00			To install with Metal (MPS), set the `LLAMA_METAL=on` environment variable before installing:

			```bash
Remove reference to FORCE_CMAKE from docs 2023-09-13 03:56:10 +00:00			`CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python`
Update README.md 2023-06-10 22:59:26 +00:00			```
Update README 2023-04-28 21:08:18 +00:00
Add ROCm / AMD instructions to docs 2023-08-25 21:19:23 +00:00			To install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on` environment variable before installing:

			```bash
Remove reference to FORCE_CMAKE from docs 2023-09-13 03:56:10 +00:00			`CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python`
Add ROCm / AMD instructions to docs 2023-08-25 21:19:23 +00:00			```

Added info to set ENV variables in PowerShell - Added an example on how to set the variables `CMAKE_ARGS` and `FORCE_CMAKE`. - Added a subtitle for the `Windows remarks` and `MacOS` remarks. 2023-07-18 21:14:42 +00:00			`#### Windows remarks`

Remove reference to FORCE_CMAKE from docs 2023-09-13 03:56:10 +00:00			To set the variables `CMAKE_ARGS`in PowerShell, follow the next steps (Example using, OpenBLAS):
Added info to set ENV variables in PowerShell - Added an example on how to set the variables `CMAKE_ARGS` and `FORCE_CMAKE`. - Added a subtitle for the `Windows remarks` and `MacOS` remarks. 2023-07-18 21:14:42 +00:00
			```ps
			`$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"`
			```

			Then, call `pip` after setting the variables:
			```
			`pip install llama-cpp-python`
			```

Fix for shared library not found and compile issues in Windows (#848) * fix windows library dll name issue * Updated README.md Windows instructions * Update llama_cpp.py to handle different windows dll file versions 2023-11-01 22:55:57 +00:00			If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:
			```ps
			`$env:CMAKE_GENERATOR = "MinGW Makefiles"`
			`$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"`
			```

Added info to set ENV variables in PowerShell - Added an example on how to set the variables `CMAKE_ARGS` and `FORCE_CMAKE`. - Added a subtitle for the `Windows remarks` and `MacOS` remarks. 2023-07-18 21:14:42 +00:00			See the above instructions and set `CMAKE_ARGS` to the BLAS backend you want to use.

			`#### MacOS remarks`

docs: fix 404 to macos installation guide. Closes #861 2023-11-22 23:07:30 +00:00			`Detailed MacOS Metal GPU install documentation is available at [docs/install/macos.md](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/)`
Move metal docs to metal section of README. 2023-06-15 02:15:22 +00:00
Update README and docs 2023-04-05 21:44:25 +00:00			`## High-level API`
Initial commit 2023-03-23 09:33:06 +00:00
Update README 2023-05-07 05:41:19 +00:00			The high-level API provides a simple managed interface through the `Llama` class.

			`Below is a short example demonstrating how to use the high-level API to generate text:`

Initial commit 2023-03-23 09:33:06 +00:00			```python
			`>>> from llama_cpp import Llama`
Update examples from ggml to gguf and add hw-accel note for Web Server (#688) * Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com> 2023-09-14 18:48:21 +00:00			`>>> llm = Llama(model_path="./models/7B/llama-model.gguf")`
Initial commit 2023-03-23 09:33:06 +00:00			`>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)`
			`>>> print(output)`
			`{`
			`"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",`
			`"object": "text_completion",`
			`"created": 1679561337,`
Update examples from ggml to gguf and add hw-accel note for Web Server (#688) * Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com> 2023-09-14 18:48:21 +00:00			`"model": "./models/7B/llama-model.gguf",`
Initial commit 2023-03-23 09:33:06 +00:00			`"choices": [`
			`{`
			`"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",`
			`"index": 0,`
			`"logprobs": None,`
			`"finish_reason": "stop"`
			`}`
			`],`
			`"usage": {`
			`"prompt_tokens": 14,`
			`"completion_tokens": 28,`
			`"total_tokens": 42`
			`}`
			`}`
			```
Update README.md 2023-03-24 04:06:24 +00:00
Show how to adjust context window in README.md 2023-07-08 03:24:57 +00:00			`### Adjusting the Context Window`
docs: Fix whitespace 2023-11-22 23:09:31 +00:00
Show how to adjust context window in README.md 2023-07-08 03:24:57 +00:00			`The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.`

			`For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:`

			```python
Update examples from ggml to gguf and add hw-accel note for Web Server (#688) * Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com> 2023-09-14 18:48:21 +00:00			`llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)`
Show how to adjust context window in README.md 2023-07-08 03:24:57 +00:00			```

add support for llama2 70b 2023-07-24 13:51:19 +00:00
Update README and docs 2023-04-05 21:44:25 +00:00			`## Web Server`

			`llama-cpp-python` offers a web server which aims to act as a drop-in replacement for the OpenAI API.
			`This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).`

			`To install the server package and get started:`

			```bash
			`pip install llama-cpp-python[server]`
Update examples from ggml to gguf and add hw-accel note for Web Server (#688) * Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com> 2023-09-14 18:48:21 +00:00			`python3 -m llama_cpp.server --model models/7B/llama-model.gguf`
			```
Documenting server usage (#768) 2023-11-21 05:24:22 +00:00
Update examples from ggml to gguf and add hw-accel note for Web Server (#688) * Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com> 2023-09-14 18:48:21 +00:00			`Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:`

			```bash
			`CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]`
			`python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35`
Update README.md add windows server commad 2023-05-05 12:21:57 +00:00			```

Update README and docs 2023-04-05 21:44:25 +00:00			`Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.`

Documenting server usage (#768) 2023-11-21 05:24:22 +00:00			To bind to `0.0.0.0` to enable remote connections, use `python3 -m llama_cpp.server --host 0.0.0.0`.
			Similarly, to change the port (default is 8000), use `--port`.
Update examples from ggml to gguf and add hw-accel note for Web Server (#688) * Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com> 2023-09-14 18:48:21 +00:00
Add Dockerfile + build workflow 2023-04-12 09:53:39 +00:00			`## Docker image`

			`A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:`

			```bash
Update examples from ggml to gguf and add hw-accel note for Web Server (#688) * Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com> 2023-09-14 18:48:21 +00:00			`docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest`
Add Dockerfile + build workflow 2023-04-12 09:53:39 +00:00			```
added termux with root instructions 2023-06-17 09:20:07 +00:00			`[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)`
Add Dockerfile + build workflow 2023-04-12 09:53:39 +00:00
Update README and docs 2023-04-05 21:44:25 +00:00			`## Low-level API`

Update README 2023-05-07 05:41:19 +00:00			The low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.
Fixed spelling error "lowe-level API" to "low-level API" 2023-08-05 07:00:04 +00:00			`The entire low-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).`
Update README 2023-05-07 05:41:19 +00:00
			`Below is a short example demonstrating how to use the low-level API to tokenize a prompt:`

			```python
			`>>> import llama_cpp`
			`>>> import ctypes`
Add numa support, low level api users must now explicitly call llama_backend_init at the start of their programs. 2023-09-14 03:00:43 +00:00			`>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program`
Update README 2023-05-07 05:41:19 +00:00			`>>> params = llama_cpp.llama_context_default_params()`
			`# use bytes for char * params`
Update examples from ggml to gguf and add hw-accel note for Web Server (#688) * Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com> 2023-09-14 18:48:21 +00:00			`>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)`
Fix low level api examples 2023-09-07 21:50:47 +00:00			`>>> ctx = llama_cpp.llama_new_context_with_model(model, params)`
Update README 2023-05-07 05:41:19 +00:00			`>>> max_tokens = params.n_ctx`
			`# use ctypes arrays for array params`
Update README.md Fix typo. 2023-05-15 21:52:25 +00:00			`>>> tokens = (llama_cpp.llama_token * int(max_tokens))()`
Update README 2023-05-07 05:41:19 +00:00			`>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))`
			`>>> llama_cpp.llama_free(ctx)`
			```

			`Check out the [examples folder](examples/low_level_api) for more examples of using the low-level API.`
Update README and docs 2023-04-05 21:44:25 +00:00

Update README 2023-04-03 01:03:39 +00:00			`# Documentation`

			`Documentation is available at [https://abetlen.github.io/llama-cpp-python](https://abetlen.github.io/llama-cpp-python).`
			`If you find any issues with the documentation, please open an issue or submit a PR.`

			`# Development`

			`This package is under active development and I welcome any contributions.`

Update development docs for scikit-build-core. Closes #490 2023-07-19 00:26:25 +00:00			`To get started, clone the repository and install the package in editable / development mode:`
Update README 2023-04-03 01:03:39 +00:00
			```bash
Fix dev setup in README.md so that everyone can run it 2023-08-02 10:27:08 +00:00			`git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git`
Update README.md prevent not found errors 2023-06-30 08:42:13 +00:00			`cd llama-cpp-python`
README: better setup instructions for developers for pip and poetry Give folks options + explicit instructions for installing with poetry or pip. 2023-05-01 06:28:50 +00:00
Update development docs for scikit-build-core. Closes #490 2023-07-19 00:26:25 +00:00			`# Upgrade pip (required for editable mode)`
			`pip install --upgrade pip`

README: better setup instructions for developers for pip and poetry Give folks options + explicit instructions for installing with poetry or pip. 2023-05-01 06:28:50 +00:00			`# Install with pip`
			`pip install -e .`

			`# if you want to use the fastapi / openapi server`
			`pip install -e .[server]`
Update development docs for scikit-build-core. Closes #490 2023-07-19 00:26:25 +00:00
			`# to install all optional dependencies`
			`pip install -e .[all]`

			`# to clear the local build cache`
			`make clean`
Update README 2023-04-03 01:03:39 +00:00			```

			# How does this compare to other Python bindings of `llama.cpp`?

Update README 2023-04-04 14:57:22 +00:00			`I originally wrote this package for my own use with two goals in mind:`
Update README 2023-04-03 01:03:39 +00:00
			- Provide a simple process to install `llama.cpp` and access the full C API in `llama.h` from Python
			- Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use `llama.cpp`

			`Any contributions and changes to this package will be made with these goals in mind.`

Update README.md 2023-03-24 04:06:24 +00:00			`# License`

			`This project is licensed under the terms of the MIT license.`