Compare commits
33 commits
2264fbf750
...
da343412ee
Author | SHA1 | Date | |
---|---|---|---|
da343412ee | |||
|
427d816ebf | ||
|
52d9d70076 | ||
|
251a8a2cad | ||
|
eebb102df7 | ||
|
5f96621e92 | ||
|
b9aca612af | ||
|
a0ce429dc0 | ||
|
410e02da51 | ||
|
eb56ce2e2a | ||
|
0f8cad6cb7 | ||
|
045cc12670 | ||
|
e10af30cf1 | ||
|
3561ebf536 | ||
|
32efed7b07 | ||
|
d80c5cf29d | ||
|
aefcb8f71a | ||
|
3921e10770 | ||
|
e6d6260a91 | ||
|
dd22010e85 | ||
|
3632241e98 | ||
|
0653e15c20 | ||
|
7981e9ce1e | ||
|
7f3962e11c | ||
|
14191e9036 | ||
|
fe5626cd40 | ||
|
7f51b6071f | ||
|
0f8aa4ab5c | ||
|
e42f62c247 | ||
|
4edde21b3d | ||
|
f57b01ac9b | ||
|
04fe33b999 | ||
|
d122bd7858 |
15 changed files with 1610 additions and 1160 deletions
2
.gitignore
vendored
2
.gitignore
vendored
|
@ -1,3 +1,5 @@
|
||||||
|
*.local
|
||||||
|
|
||||||
.python-version
|
.python-version
|
||||||
|
|
||||||
.vscode/
|
.vscode/
|
||||||
|
|
23
CHANGELOG.md
23
CHANGELOG.md
|
@ -7,6 +7,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
## [0.2.48]
|
||||||
|
|
||||||
|
- feat: Update llama.cpp to ggerganov/llama.cpp@15499eb94227401bdc8875da6eb85c15d37068f7
|
||||||
|
- feat: Add Google's Gemma formatting via chat_format="gemma" by @alvarobartt in #1210
|
||||||
|
- feat: support minItems/maxItems in JSON grammar converter by @nopperl in 3921e10770996d95a9eb22c8248bacef39f69365
|
||||||
|
- fix: Update from_pretrained defaults to match hf_hub_download and pull to local cache folder by @abetlen in e6d6260a91b7831733f7d1f73c7af46a3e8185ed
|
||||||
|
- fix: Raise exceptions when llama model or context fails to load by @abetlen in dd22010e85265ae840c76ec835d67a29ed852722
|
||||||
|
- docs: Update README.md to fix pip install llama cpp server by @audip in #1187
|
||||||
|
|
||||||
|
## [0.2.47]
|
||||||
|
|
||||||
|
- feat: Update llama.cpp to ggerganov/llama.cpp@973053d8b0d04809836b3339a50f68d9c842de90
|
||||||
|
|
||||||
|
## [0.2.46]
|
||||||
|
|
||||||
|
- feat: Update llama.cpp to ggerganov/llama.cpp@ba2135ccae7462470b3865c6e41d2e1d734eac05
|
||||||
|
- feat: Pull models directly from huggingface by @abetlen in #1206
|
||||||
|
- feat(low-level-api): Improve API static type-safety and performance. Low level api functions are positional args only now. by @abetlen in #1205
|
||||||
|
|
||||||
|
## [0.2.45]
|
||||||
|
|
||||||
|
- feat: Update llama.cpp to ggerganov/llama.cpp@89febfed9322c8849520dc63c93ee4f5fd72556e
|
||||||
|
|
||||||
## [0.2.44]
|
## [0.2.44]
|
||||||
|
|
||||||
- feat: Update llama.cpp to ggerganov/llama.cpp@4524290e87b8e107cc2b56e1251751546f4b9051
|
- feat: Update llama.cpp to ggerganov/llama.cpp@4524290e87b8e107cc2b56e1251751546f4b9051
|
||||||
|
|
3
Makefile
3
Makefile
|
@ -12,6 +12,9 @@ deps:
|
||||||
build:
|
build:
|
||||||
python3 -m pip install --verbose -e .
|
python3 -m pip install --verbose -e .
|
||||||
|
|
||||||
|
build.debug:
|
||||||
|
CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Debug" python3 -m pip install --verbose --config-settings=cmake.verbose=true --config-settings=logging.level=INFO --config-settings=install.strip=false --editable .
|
||||||
|
|
||||||
build.cuda:
|
build.cuda:
|
||||||
CMAKE_ARGS="-DLLAMA_CUBLAS=on" python3 -m pip install --verbose -e .
|
CMAKE_ARGS="-DLLAMA_CUBLAS=on" python3 -m pip install --verbose -e .
|
||||||
|
|
||||||
|
|
169
README.md
169
README.md
|
@ -12,60 +12,94 @@ This package provides:
|
||||||
|
|
||||||
- Low-level access to C API via `ctypes` interface.
|
- Low-level access to C API via `ctypes` interface.
|
||||||
- High-level Python API for text completion
|
- High-level Python API for text completion
|
||||||
- OpenAI-like API
|
- OpenAI-like API
|
||||||
- [LangChain compatibility](https://python.langchain.com/docs/integrations/llms/llamacpp)
|
- [LangChain compatibility](https://python.langchain.com/docs/integrations/llms/llamacpp)
|
||||||
- [LlamaIndex compatibility](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html)
|
- [LlamaIndex compatibility](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html)
|
||||||
- OpenAI compatible web server
|
- OpenAI compatible web server
|
||||||
- [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)
|
- [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)
|
||||||
- [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)
|
- [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)
|
||||||
- [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)
|
- [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)
|
||||||
- [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)
|
- [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)
|
||||||
|
|
||||||
Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).
|
Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
`llama-cpp-python` can be installed directly from PyPI as a source distribution by running:
|
Requirements:
|
||||||
|
|
||||||
|
- Python 3.8+
|
||||||
|
- C compiler
|
||||||
|
- Linux: gcc or clang
|
||||||
|
- Windows: Visual Studio or MinGW
|
||||||
|
- MacOS: Xcode
|
||||||
|
|
||||||
|
To install the package, run:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install llama-cpp-python
|
pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
|
||||||
This will build `llama.cpp` from source using cmake and your system's c compiler (required) and install the library alongside this python package.
|
This will also build `llama.cpp` from source and install it alongside this python package.
|
||||||
|
|
||||||
If you run into issues during installation add the `--verbose` flag to the `pip install` command to see the full cmake build log.
|
If this fails, add `--verbose` to the `pip install` see the full cmake build log.
|
||||||
|
|
||||||
### Installation with Specific Hardware Acceleration (BLAS, CUDA, Metal, etc)
|
### Installation Configuration
|
||||||
|
|
||||||
The default pip install behaviour is to build `llama.cpp` for CPU only on Linux and Windows and use Metal on MacOS.
|
`llama.cpp` supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the [llama.cpp README](https://github.com/ggerganov/llama.cpp#build) for a full list.
|
||||||
|
|
||||||
`llama.cpp` supports a number of hardware acceleration backends depending including OpenBLAS, cuBLAS, CLBlast, HIPBLAS, and Metal.
|
All `llama.cpp` cmake build options can be set via the `CMAKE_ARGS` environment variable or via the `--config-settings / -C` cli flag during installation.
|
||||||
See the [llama.cpp README](https://github.com/ggerganov/llama.cpp#build) for a full list of supported backends.
|
|
||||||
|
|
||||||
All of these backends are supported by `llama-cpp-python` and can be enabled by setting the `CMAKE_ARGS` environment variable before installing.
|
<details open>
|
||||||
|
<summary>Environment Variables</summary>
|
||||||
On Linux and Mac you set the `CMAKE_ARGS` like this:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
|
# Linux and Mac
|
||||||
|
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" \
|
||||||
|
pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
|
||||||
On Windows you can set the `CMAKE_ARGS` like this:
|
```powershell
|
||||||
|
# Windows
|
||||||
```ps
|
|
||||||
$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
|
$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
|
||||||
pip install llama-cpp-python
|
pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
</details>
|
||||||
|
|
||||||
#### OpenBLAS
|
<details>
|
||||||
|
<summary>CLI / requirements.txt</summary>
|
||||||
|
|
||||||
To install with OpenBLAS, set the `LLAMA_BLAS and LLAMA_BLAS_VENDOR` environment variables before installing:
|
They can also be set via `pip install -C / --config-settings` command and saved to a `requirements.txt` file:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install --upgrade pip # ensure pip is up to date
|
||||||
|
pip install llama-cpp-python \
|
||||||
|
-C cmake.args="-DLLAMA_BLAS=ON;-DLLAMA_BLAS_VENDOR=OpenBLAS"
|
||||||
|
```
|
||||||
|
|
||||||
|
```txt
|
||||||
|
# requirements.txt
|
||||||
|
|
||||||
|
llama-cpp-python -C cmake.args="-DLLAMA_BLAS=ON;-DLLAMA_BLAS_VENDOR=OpenBLAS"
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
### Supported Backends
|
||||||
|
|
||||||
|
Below are some common backends, their build commands and any additional environment variables required.
|
||||||
|
|
||||||
|
<details open>
|
||||||
|
<summary>OpenBLAS (CPU)</summary>
|
||||||
|
|
||||||
|
To install with OpenBLAS, set the `LLAMA_BLAS` and `LLAMA_BLAS_VENDOR` environment variables before installing:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
</details>
|
||||||
|
|
||||||
#### cuBLAS
|
<details>
|
||||||
|
<summary>cuBLAS (CUDA)</summary>
|
||||||
|
|
||||||
To install with cuBLAS, set the `LLAMA_CUBLAS=on` environment variable before installing:
|
To install with cuBLAS, set the `LLAMA_CUBLAS=on` environment variable before installing:
|
||||||
|
|
||||||
|
@ -73,7 +107,10 @@ To install with cuBLAS, set the `LLAMA_CUBLAS=on` environment variable before in
|
||||||
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Metal
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Metal</summary>
|
||||||
|
|
||||||
To install with Metal (MPS), set the `LLAMA_METAL=on` environment variable before installing:
|
To install with Metal (MPS), set the `LLAMA_METAL=on` environment variable before installing:
|
||||||
|
|
||||||
|
@ -81,7 +118,10 @@ To install with Metal (MPS), set the `LLAMA_METAL=on` environment variable befor
|
||||||
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
|
||||||
#### CLBlast
|
</details>
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>CLBlast (OpenCL)</summary>
|
||||||
|
|
||||||
To install with CLBlast, set the `LLAMA_CLBLAST=on` environment variable before installing:
|
To install with CLBlast, set the `LLAMA_CLBLAST=on` environment variable before installing:
|
||||||
|
|
||||||
|
@ -89,7 +129,10 @@ To install with CLBlast, set the `LLAMA_CLBLAST=on` environment variable before
|
||||||
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
|
||||||
#### hipBLAS
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>hipBLAS (ROCm)</summary>
|
||||||
|
|
||||||
To install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on` environment variable before installing:
|
To install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on` environment variable before installing:
|
||||||
|
|
||||||
|
@ -97,7 +140,10 @@ To install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on`
|
||||||
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Vulkan
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Vulkan</summary>
|
||||||
|
|
||||||
To install with Vulkan support, set the `LLAMA_VULKAN=on` environment variable before installing:
|
To install with Vulkan support, set the `LLAMA_VULKAN=on` environment variable before installing:
|
||||||
|
|
||||||
|
@ -105,15 +151,20 @@ To install with Vulkan support, set the `LLAMA_VULKAN=on` environment variable b
|
||||||
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Kompute
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Kompute</summary>
|
||||||
|
|
||||||
To install with Kompute support, set the `LLAMA_KOMPUTE=on` environment variable before installing:
|
To install with Kompute support, set the `LLAMA_KOMPUTE=on` environment variable before installing:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
</details>
|
||||||
|
|
||||||
#### SYCL
|
<details>
|
||||||
|
<summary>SYCL</summary>
|
||||||
|
|
||||||
To install with SYCL support, set the `LLAMA_SYCL=on` environment variable before installing:
|
To install with SYCL support, set the `LLAMA_SYCL=on` environment variable before installing:
|
||||||
|
|
||||||
|
@ -121,9 +172,14 @@ To install with SYCL support, set the `LLAMA_SYCL=on` environment variable befor
|
||||||
source /opt/intel/oneapi/setvars.sh
|
source /opt/intel/oneapi/setvars.sh
|
||||||
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
</details>
|
||||||
|
|
||||||
|
|
||||||
### Windows Notes
|
### Windows Notes
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Error: Can't find 'nmake' or 'CMAKE_C_COMPILER'</summary>
|
||||||
|
|
||||||
If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:
|
If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:
|
||||||
|
|
||||||
```ps
|
```ps
|
||||||
|
@ -132,12 +188,14 @@ $env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.e
|
||||||
```
|
```
|
||||||
|
|
||||||
See the above instructions and set `CMAKE_ARGS` to the BLAS backend you want to use.
|
See the above instructions and set `CMAKE_ARGS` to the BLAS backend you want to use.
|
||||||
|
</details>
|
||||||
|
|
||||||
### MacOS Notes
|
### MacOS Notes
|
||||||
|
|
||||||
Detailed MacOS Metal GPU install documentation is available at [docs/install/macos.md](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/)
|
Detailed MacOS Metal GPU install documentation is available at [docs/install/macos.md](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/)
|
||||||
|
|
||||||
#### M1 Mac Performance Issue
|
<details>
|
||||||
|
<summary>M1 Mac Performance Issue</summary>
|
||||||
|
|
||||||
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
|
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
|
||||||
|
|
||||||
|
@ -147,24 +205,21 @@ bash Miniforge3-MacOSX-arm64.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
|
Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
|
||||||
|
</details>
|
||||||
|
|
||||||
#### M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`
|
<details>
|
||||||
|
<summary>M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`</summary>
|
||||||
|
|
||||||
Try installing with
|
Try installing with
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
|
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
|
||||||
```
|
```
|
||||||
|
</details>
|
||||||
|
|
||||||
### Upgrading and Reinstalling
|
### Upgrading and Reinstalling
|
||||||
|
|
||||||
To upgrade or rebuild `llama-cpp-python` add the following flags to ensure that the package is rebuilt correctly:
|
To upgrade and rebuild `llama-cpp-python` add `--upgrade --force-reinstall --no-cache-dir` flags to the `pip install` command to ensure the package is rebuilt from source.
|
||||||
|
|
||||||
```bash
|
|
||||||
pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
|
|
||||||
```
|
|
||||||
|
|
||||||
This will ensure that all source files are re-built with the most recently set `CMAKE_ARGS` flags.
|
|
||||||
|
|
||||||
## High-level API
|
## High-level API
|
||||||
|
|
||||||
|
@ -212,6 +267,21 @@ Below is a short example demonstrating how to use the high-level API to for basi
|
||||||
|
|
||||||
Text completion is available through the [`__call__`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__) and [`create_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) methods of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
|
Text completion is available through the [`__call__`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__) and [`create_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) methods of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
|
||||||
|
|
||||||
|
### Pulling models from Hugging Face Hub
|
||||||
|
|
||||||
|
You can download `Llama` models in `gguf` format directly from Hugging Face using the [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) method.
|
||||||
|
You'll need to install the `huggingface-hub` package to use this feature (`pip install huggingface-hub`).
|
||||||
|
|
||||||
|
```python
|
||||||
|
llm = Llama.from_pretrained(
|
||||||
|
repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
|
||||||
|
filename="*q8_0.gguf",
|
||||||
|
verbose=False
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
By default [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) will download the model to the huggingface cache directory, you can then manage installed model files with the [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) tool.
|
||||||
|
|
||||||
### Chat Completion
|
### Chat Completion
|
||||||
|
|
||||||
The high-level API also provides a simple interface for chat completion.
|
The high-level API also provides a simple interface for chat completion.
|
||||||
|
@ -237,13 +307,16 @@ Note that `chat_format` option must be set for the particular model you are usin
|
||||||
|
|
||||||
Chat completion is available through the [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) method of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
|
Chat completion is available through the [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) method of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
|
||||||
|
|
||||||
|
For OpenAI API v1 compatibility, you use the [`create_chat_completion_openai_v1`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion_openai_v1) method which will return pydantic models instead of dicts.
|
||||||
|
|
||||||
|
|
||||||
### JSON and JSON Schema Mode
|
### JSON and JSON Schema Mode
|
||||||
|
|
||||||
If you want to constrain chat responses to only valid JSON or a specific JSON Schema you can use the `response_format` argument to the `create_chat_completion` method.
|
To constrain chat responses to only valid JSON or a specific JSON Schema use the `response_format` argument in [`create_chat_completion`](http://localhost:8000/api-reference/#llama_cpp.Llama.create_chat_completion).
|
||||||
|
|
||||||
#### JSON Mode
|
#### JSON Mode
|
||||||
|
|
||||||
The following example will constrain the response to be valid JSON.
|
The following example will constrain the response to valid JSON strings only.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from llama_cpp import Llama
|
>>> from llama_cpp import Llama
|
||||||
|
@ -265,7 +338,7 @@ The following example will constrain the response to be valid JSON.
|
||||||
|
|
||||||
#### JSON Schema Mode
|
#### JSON Schema Mode
|
||||||
|
|
||||||
To constrain the response to a specific JSON Schema, you can use the `schema` property of the `response_format` argument.
|
To constrain the response further to a specific JSON Schema add the schema to the `schema` property of the `response_format` argument.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from llama_cpp import Llama
|
>>> from llama_cpp import Llama
|
||||||
|
@ -400,7 +473,7 @@ llama = Llama(
|
||||||
|
|
||||||
### Embeddings
|
### Embeddings
|
||||||
|
|
||||||
`llama-cpp-python` supports generating embeddings from the text.
|
To generate text embeddings use [`create_embedding`](http://localhost:8000/api-reference/#llama_cpp.Llama.create_embedding).
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import llama_cpp
|
import llama_cpp
|
||||||
|
@ -409,7 +482,7 @@ llm = llama_cpp.Llama(model_path="path/to/model.gguf", embeddings=True)
|
||||||
|
|
||||||
embeddings = llm.create_embedding("Hello, world!")
|
embeddings = llm.create_embedding("Hello, world!")
|
||||||
|
|
||||||
# or batched
|
# or create multiple embeddings at once
|
||||||
|
|
||||||
embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])
|
embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])
|
||||||
```
|
```
|
||||||
|
@ -432,14 +505,14 @@ This allows you to use llama.cpp compatible models with any OpenAI compatible cl
|
||||||
To install the server package and get started:
|
To install the server package and get started:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install llama-cpp-python[server]
|
pip install 'llama-cpp-python[server]'
|
||||||
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
|
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
|
||||||
```
|
```
|
||||||
|
|
||||||
Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:
|
Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]
|
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install 'llama-cpp-python[server]'
|
||||||
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
|
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -486,7 +559,7 @@ Below is a short example demonstrating how to use the low-level API to tokenize
|
||||||
```python
|
```python
|
||||||
>>> import llama_cpp
|
>>> import llama_cpp
|
||||||
>>> import ctypes
|
>>> import ctypes
|
||||||
>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
|
>>> llama_cpp.llama_backend_init(False) # Must be called once at the start of each program
|
||||||
>>> params = llama_cpp.llama_context_default_params()
|
>>> params = llama_cpp.llama_context_default_params()
|
||||||
# use bytes for char * params
|
# use bytes for char * params
|
||||||
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
|
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
|
||||||
|
@ -494,7 +567,7 @@ Below is a short example demonstrating how to use the low-level API to tokenize
|
||||||
>>> max_tokens = params.n_ctx
|
>>> max_tokens = params.n_ctx
|
||||||
# use ctypes arrays for array params
|
# use ctypes arrays for array params
|
||||||
>>> tokens = (llama_cpp.llama_token * int(max_tokens))()
|
>>> tokens = (llama_cpp.llama_token * int(max_tokens))()
|
||||||
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
|
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, llama_cpp.c_bool(True))
|
||||||
>>> llama_cpp.llama_free(ctx)
|
>>> llama_cpp.llama_free(ctx)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
|
@ -21,11 +21,13 @@ High-level Python bindings for llama.cpp.
|
||||||
- create_completion
|
- create_completion
|
||||||
- __call__
|
- __call__
|
||||||
- create_chat_completion
|
- create_chat_completion
|
||||||
|
- create_chat_completion_openai_v1
|
||||||
- set_cache
|
- set_cache
|
||||||
- save_state
|
- save_state
|
||||||
- load_state
|
- load_state
|
||||||
- token_bos
|
- token_bos
|
||||||
- token_eos
|
- token_eos
|
||||||
|
- from_pretrained
|
||||||
show_root_heading: true
|
show_root_heading: true
|
||||||
|
|
||||||
::: llama_cpp.LlamaGrammar
|
::: llama_cpp.LlamaGrammar
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from .llama_cpp import *
|
from .llama_cpp import *
|
||||||
from .llama import *
|
from .llama import *
|
||||||
|
|
||||||
__version__ = "0.2.44"
|
__version__ = "0.2.48"
|
|
@ -51,6 +51,9 @@ class _LlamaModel:
|
||||||
self.path_model.encode("utf-8"), self.params
|
self.path_model.encode("utf-8"), self.params
|
||||||
)
|
)
|
||||||
|
|
||||||
|
if self.model is None:
|
||||||
|
raise ValueError(f"Failed to load model from file: {path_model}")
|
||||||
|
|
||||||
def __del__(self):
|
def __del__(self):
|
||||||
if self.model is not None and self._llama_free_model is not None:
|
if self.model is not None and self._llama_free_model is not None:
|
||||||
self._llama_free_model(self.model)
|
self._llama_free_model(self.model)
|
||||||
|
@ -79,7 +82,7 @@ class _LlamaModel:
|
||||||
def desc(self) -> str:
|
def desc(self) -> str:
|
||||||
assert self.model is not None
|
assert self.model is not None
|
||||||
buf = ctypes.create_string_buffer(1024)
|
buf = ctypes.create_string_buffer(1024)
|
||||||
llama_cpp.llama_model_desc(self.model, buf, 1024) # type: ignore
|
llama_cpp.llama_model_desc(self.model, buf, 1024)
|
||||||
return buf.value.decode("utf-8")
|
return buf.value.decode("utf-8")
|
||||||
|
|
||||||
def size(self) -> int:
|
def size(self) -> int:
|
||||||
|
@ -108,7 +111,7 @@ class _LlamaModel:
|
||||||
scale,
|
scale,
|
||||||
path_base_model.encode("utf-8")
|
path_base_model.encode("utf-8")
|
||||||
if path_base_model is not None
|
if path_base_model is not None
|
||||||
else llama_cpp.c_char_p(0),
|
else ctypes.c_char_p(0),
|
||||||
n_threads,
|
n_threads,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -181,7 +184,7 @@ class _LlamaModel:
|
||||||
def token_to_piece(self, token: int) -> bytes:
|
def token_to_piece(self, token: int) -> bytes:
|
||||||
assert self.model is not None
|
assert self.model is not None
|
||||||
buf = ctypes.create_string_buffer(32)
|
buf = ctypes.create_string_buffer(32)
|
||||||
llama_cpp.llama_token_to_piece(self.model, token, buf, 32) # type: ignore
|
llama_cpp.llama_token_to_piece(self.model, token, buf, 32)
|
||||||
return bytes(buf)
|
return bytes(buf)
|
||||||
|
|
||||||
def detokenize(self, tokens: List[int]) -> bytes:
|
def detokenize(self, tokens: List[int]) -> bytes:
|
||||||
|
@ -258,6 +261,9 @@ class _LlamaContext:
|
||||||
self.model.model, self.params
|
self.model.model, self.params
|
||||||
)
|
)
|
||||||
|
|
||||||
|
if self.ctx is None:
|
||||||
|
raise ValueError("Failed to create llama_context")
|
||||||
|
|
||||||
def __del__(self):
|
def __del__(self):
|
||||||
if self.ctx is not None and self._llama_free is not None:
|
if self.ctx is not None and self._llama_free is not None:
|
||||||
self._llama_free(self.ctx)
|
self._llama_free(self.ctx)
|
||||||
|
@ -303,8 +309,8 @@ class _LlamaContext:
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
assert batch.batch is not None
|
assert batch.batch is not None
|
||||||
return_code = llama_cpp.llama_decode(
|
return_code = llama_cpp.llama_decode(
|
||||||
ctx=self.ctx,
|
self.ctx,
|
||||||
batch=batch.batch,
|
batch.batch,
|
||||||
)
|
)
|
||||||
if return_code != 0:
|
if return_code != 0:
|
||||||
raise RuntimeError(f"llama_decode returned {return_code}")
|
raise RuntimeError(f"llama_decode returned {return_code}")
|
||||||
|
@ -343,7 +349,7 @@ class _LlamaContext:
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
llama_cpp.llama_sample_repetition_penalties(
|
llama_cpp.llama_sample_repetition_penalties(
|
||||||
self.ctx,
|
self.ctx,
|
||||||
ctypes.byref(candidates.candidates), # type: ignore
|
llama_cpp.byref(candidates.candidates),
|
||||||
last_tokens_data,
|
last_tokens_data,
|
||||||
penalty_last_n,
|
penalty_last_n,
|
||||||
penalty_repeat,
|
penalty_repeat,
|
||||||
|
@ -361,7 +367,7 @@ class _LlamaContext:
|
||||||
assert guidance_ctx.ctx is not None
|
assert guidance_ctx.ctx is not None
|
||||||
llama_cpp.llama_sample_classifier_free_guidance(
|
llama_cpp.llama_sample_classifier_free_guidance(
|
||||||
self.ctx,
|
self.ctx,
|
||||||
ctypes.byref(candidates.candidates), # type: ignore
|
llama_cpp.byref(candidates.candidates),
|
||||||
guidance_ctx.ctx,
|
guidance_ctx.ctx,
|
||||||
scale,
|
scale,
|
||||||
)
|
)
|
||||||
|
@ -370,25 +376,25 @@ class _LlamaContext:
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
llama_cpp.llama_sample_softmax(
|
llama_cpp.llama_sample_softmax(
|
||||||
self.ctx,
|
self.ctx,
|
||||||
ctypes.byref(candidates.candidates), # type: ignore
|
llama_cpp.byref(candidates.candidates),
|
||||||
)
|
)
|
||||||
|
|
||||||
def sample_top_k(self, candidates: "_LlamaTokenDataArray", k: int, min_keep: int):
|
def sample_top_k(self, candidates: "_LlamaTokenDataArray", k: int, min_keep: int):
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
llama_cpp.llama_sample_top_k(
|
llama_cpp.llama_sample_top_k(
|
||||||
self.ctx, ctypes.byref(candidates.candidates), k, min_keep # type: ignore
|
self.ctx, llama_cpp.byref(candidates.candidates), k, min_keep
|
||||||
)
|
)
|
||||||
|
|
||||||
def sample_top_p(self, candidates: "_LlamaTokenDataArray", p: float, min_keep: int):
|
def sample_top_p(self, candidates: "_LlamaTokenDataArray", p: float, min_keep: int):
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
llama_cpp.llama_sample_top_p(
|
llama_cpp.llama_sample_top_p(
|
||||||
self.ctx, ctypes.byref(candidates.candidates), p, min_keep # type: ignore
|
self.ctx, llama_cpp.byref(candidates.candidates), p, min_keep
|
||||||
)
|
)
|
||||||
|
|
||||||
def sample_min_p(self, candidates: "_LlamaTokenDataArray", p: float, min_keep: int):
|
def sample_min_p(self, candidates: "_LlamaTokenDataArray", p: float, min_keep: int):
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
llama_cpp.llama_sample_min_p(
|
llama_cpp.llama_sample_min_p(
|
||||||
self.ctx, ctypes.byref(candidates.candidates), p, min_keep # type: ignore
|
self.ctx, llama_cpp.byref(candidates.candidates), p, min_keep
|
||||||
)
|
)
|
||||||
|
|
||||||
def sample_tail_free(
|
def sample_tail_free(
|
||||||
|
@ -396,7 +402,7 @@ class _LlamaContext:
|
||||||
):
|
):
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
llama_cpp.llama_sample_tail_free(
|
llama_cpp.llama_sample_tail_free(
|
||||||
self.ctx, ctypes.byref(candidates.candidates), z, min_keep # type: ignore
|
self.ctx, llama_cpp.byref(candidates.candidates), z, min_keep
|
||||||
)
|
)
|
||||||
|
|
||||||
def sample_typical(
|
def sample_typical(
|
||||||
|
@ -404,13 +410,13 @@ class _LlamaContext:
|
||||||
):
|
):
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
llama_cpp.llama_sample_typical(
|
llama_cpp.llama_sample_typical(
|
||||||
self.ctx, ctypes.byref(candidates.candidates), p, min_keep # type: ignore
|
self.ctx, llama_cpp.byref(candidates.candidates), p, min_keep
|
||||||
)
|
)
|
||||||
|
|
||||||
def sample_temp(self, candidates: "_LlamaTokenDataArray", temp: float):
|
def sample_temp(self, candidates: "_LlamaTokenDataArray", temp: float):
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
llama_cpp.llama_sample_temp(
|
llama_cpp.llama_sample_temp(
|
||||||
self.ctx, ctypes.byref(candidates.candidates), temp # type: ignore
|
self.ctx, llama_cpp.byref(candidates.candidates), temp
|
||||||
)
|
)
|
||||||
|
|
||||||
def sample_grammar(self, candidates: "_LlamaTokenDataArray", grammar: LlamaGrammar):
|
def sample_grammar(self, candidates: "_LlamaTokenDataArray", grammar: LlamaGrammar):
|
||||||
|
@ -418,7 +424,7 @@ class _LlamaContext:
|
||||||
assert grammar.grammar is not None
|
assert grammar.grammar is not None
|
||||||
llama_cpp.llama_sample_grammar(
|
llama_cpp.llama_sample_grammar(
|
||||||
self.ctx,
|
self.ctx,
|
||||||
ctypes.byref(candidates.candidates), # type: ignore
|
llama_cpp.byref(candidates.candidates),
|
||||||
grammar.grammar,
|
grammar.grammar,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -428,12 +434,12 @@ class _LlamaContext:
|
||||||
tau: float,
|
tau: float,
|
||||||
eta: float,
|
eta: float,
|
||||||
m: int,
|
m: int,
|
||||||
mu: ctypes._Pointer[ctypes.c_float], # type: ignore
|
mu: llama_cpp.CtypesPointerOrRef[ctypes.c_float],
|
||||||
) -> int:
|
) -> int:
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
return llama_cpp.llama_sample_token_mirostat(
|
return llama_cpp.llama_sample_token_mirostat(
|
||||||
self.ctx,
|
self.ctx,
|
||||||
ctypes.byref(candidates.candidates), # type: ignore
|
llama_cpp.byref(candidates.candidates),
|
||||||
tau,
|
tau,
|
||||||
eta,
|
eta,
|
||||||
m,
|
m,
|
||||||
|
@ -441,12 +447,12 @@ class _LlamaContext:
|
||||||
)
|
)
|
||||||
|
|
||||||
def sample_token_mirostat_v2(
|
def sample_token_mirostat_v2(
|
||||||
self, candidates: "_LlamaTokenDataArray", tau: float, eta: float, mu: ctypes._Pointer[ctypes.c_float] # type: ignore
|
self, candidates: "_LlamaTokenDataArray", tau: float, eta: float, mu: llama_cpp.CtypesPointerOrRef[ctypes.c_float]
|
||||||
) -> int:
|
) -> int:
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
return llama_cpp.llama_sample_token_mirostat_v2(
|
return llama_cpp.llama_sample_token_mirostat_v2(
|
||||||
self.ctx,
|
self.ctx,
|
||||||
ctypes.byref(candidates.candidates), # type: ignore
|
llama_cpp.byref(candidates.candidates),
|
||||||
tau,
|
tau,
|
||||||
eta,
|
eta,
|
||||||
mu,
|
mu,
|
||||||
|
@ -456,14 +462,14 @@ class _LlamaContext:
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
return llama_cpp.llama_sample_token_greedy(
|
return llama_cpp.llama_sample_token_greedy(
|
||||||
self.ctx,
|
self.ctx,
|
||||||
ctypes.byref(candidates.candidates), # type: ignore
|
llama_cpp.byref(candidates.candidates),
|
||||||
)
|
)
|
||||||
|
|
||||||
def sample_token(self, candidates: "_LlamaTokenDataArray") -> int:
|
def sample_token(self, candidates: "_LlamaTokenDataArray") -> int:
|
||||||
assert self.ctx is not None
|
assert self.ctx is not None
|
||||||
return llama_cpp.llama_sample_token(
|
return llama_cpp.llama_sample_token(
|
||||||
self.ctx,
|
self.ctx,
|
||||||
ctypes.byref(candidates.candidates), # type: ignore
|
llama_cpp.byref(candidates.candidates),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Grammar
|
# Grammar
|
||||||
|
@ -493,7 +499,7 @@ class _LlamaBatch:
|
||||||
def __init__(
|
def __init__(
|
||||||
self, *, n_tokens: int, embd: int, n_seq_max: int, verbose: bool = True
|
self, *, n_tokens: int, embd: int, n_seq_max: int, verbose: bool = True
|
||||||
):
|
):
|
||||||
self.n_tokens = n_tokens
|
self._n_tokens = n_tokens
|
||||||
self.embd = embd
|
self.embd = embd
|
||||||
self.n_seq_max = n_seq_max
|
self.n_seq_max = n_seq_max
|
||||||
self.verbose = verbose
|
self.verbose = verbose
|
||||||
|
@ -502,7 +508,7 @@ class _LlamaBatch:
|
||||||
|
|
||||||
self.batch = None
|
self.batch = None
|
||||||
self.batch = llama_cpp.llama_batch_init(
|
self.batch = llama_cpp.llama_batch_init(
|
||||||
self.n_tokens, self.embd, self.n_seq_max
|
self._n_tokens, self.embd, self.n_seq_max
|
||||||
)
|
)
|
||||||
|
|
||||||
def __del__(self):
|
def __del__(self):
|
||||||
|
@ -560,7 +566,7 @@ class _LlamaTokenDataArray:
|
||||||
size=self.n_vocab,
|
size=self.n_vocab,
|
||||||
sorted=False,
|
sorted=False,
|
||||||
)
|
)
|
||||||
self.default_candidates_data_id = np.arange(self.n_vocab, dtype=np.intc)
|
self.default_candidates_data_id = np.arange(self.n_vocab, dtype=np.intc) # type: ignore
|
||||||
self.default_candidates_data_p = np.zeros(self.n_vocab, dtype=np.single)
|
self.default_candidates_data_p = np.zeros(self.n_vocab, dtype=np.single)
|
||||||
|
|
||||||
def copy_logits(self, logits: npt.NDArray[np.single]):
|
def copy_logits(self, logits: npt.NDArray[np.single]):
|
||||||
|
@ -570,12 +576,13 @@ class _LlamaTokenDataArray:
|
||||||
self.candidates.data = self.candidates_data.ctypes.data_as(
|
self.candidates.data = self.candidates_data.ctypes.data_as(
|
||||||
llama_cpp.llama_token_data_p
|
llama_cpp.llama_token_data_p
|
||||||
)
|
)
|
||||||
self.candidates.sorted = llama_cpp.c_bool(False)
|
self.candidates.sorted = ctypes.c_bool(False)
|
||||||
self.candidates.size = llama_cpp.c_size_t(self.n_vocab)
|
self.candidates.size = ctypes.c_size_t(self.n_vocab)
|
||||||
|
|
||||||
|
|
||||||
# Python wrappers over common/common
|
# Python wrappers over common/common
|
||||||
def _tokenize(model: _LlamaModel, text: str, add_bos: bool, special: bool) -> list[int]:
|
def _tokenize(model: _LlamaModel, text: str, add_bos: bool, special: bool) -> list[int]:
|
||||||
|
assert model.model is not None
|
||||||
n_tokens = len(text) + 1 if add_bos else len(text)
|
n_tokens = len(text) + 1 if add_bos else len(text)
|
||||||
result = (llama_cpp.llama_token * n_tokens)()
|
result = (llama_cpp.llama_token * n_tokens)()
|
||||||
n_tokens = llama_cpp.llama_tokenize(
|
n_tokens = llama_cpp.llama_tokenize(
|
||||||
|
@ -747,7 +754,7 @@ class _LlamaSamplingContext:
|
||||||
ctx_main.sample_repetition_penalties(
|
ctx_main.sample_repetition_penalties(
|
||||||
token_data_array,
|
token_data_array,
|
||||||
# TODO: Only create this once
|
# TODO: Only create this once
|
||||||
(llama_cpp.llama_token * len(self.prev))(*self.prev), # type: ignore
|
(llama_cpp.llama_token * len(self.prev))(*self.prev),
|
||||||
self.params.penalty_last_n,
|
self.params.penalty_last_n,
|
||||||
self.params.penalty_repeat,
|
self.params.penalty_repeat,
|
||||||
self.params.penalty_freq,
|
self.params.penalty_freq,
|
||||||
|
|
|
@ -4,6 +4,8 @@ import os
|
||||||
import sys
|
import sys
|
||||||
import uuid
|
import uuid
|
||||||
import time
|
import time
|
||||||
|
import json
|
||||||
|
import fnmatch
|
||||||
import multiprocessing
|
import multiprocessing
|
||||||
from typing import (
|
from typing import (
|
||||||
List,
|
List,
|
||||||
|
@ -16,6 +18,7 @@ from typing import (
|
||||||
Callable,
|
Callable,
|
||||||
)
|
)
|
||||||
from collections import deque
|
from collections import deque
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
import ctypes
|
import ctypes
|
||||||
|
|
||||||
|
@ -29,10 +32,7 @@ from .llama_cache import (
|
||||||
LlamaDiskCache, # type: ignore
|
LlamaDiskCache, # type: ignore
|
||||||
LlamaRAMCache, # type: ignore
|
LlamaRAMCache, # type: ignore
|
||||||
)
|
)
|
||||||
from .llama_tokenizer import (
|
from .llama_tokenizer import BaseLlamaTokenizer, LlamaTokenizer
|
||||||
BaseLlamaTokenizer,
|
|
||||||
LlamaTokenizer
|
|
||||||
)
|
|
||||||
import llama_cpp.llama_cpp as llama_cpp
|
import llama_cpp.llama_cpp as llama_cpp
|
||||||
import llama_cpp.llama_chat_format as llama_chat_format
|
import llama_cpp.llama_chat_format as llama_chat_format
|
||||||
|
|
||||||
|
@ -50,9 +50,7 @@ from ._internals import (
|
||||||
_LlamaSamplingContext, # type: ignore
|
_LlamaSamplingContext, # type: ignore
|
||||||
)
|
)
|
||||||
from ._logger import set_verbose
|
from ._logger import set_verbose
|
||||||
from ._utils import (
|
from ._utils import suppress_stdout_stderr
|
||||||
suppress_stdout_stderr
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class Llama:
|
class Llama:
|
||||||
|
@ -189,7 +187,11 @@ class Llama:
|
||||||
Llama.__backend_initialized = True
|
Llama.__backend_initialized = True
|
||||||
|
|
||||||
if isinstance(numa, bool):
|
if isinstance(numa, bool):
|
||||||
self.numa = llama_cpp.GGML_NUMA_STRATEGY_DISTRIBUTE if numa else llama_cpp.GGML_NUMA_STRATEGY_DISABLED
|
self.numa = (
|
||||||
|
llama_cpp.GGML_NUMA_STRATEGY_DISTRIBUTE
|
||||||
|
if numa
|
||||||
|
else llama_cpp.GGML_NUMA_STRATEGY_DISABLED
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
self.numa = numa
|
self.numa = numa
|
||||||
|
|
||||||
|
@ -246,9 +248,9 @@ class Llama:
|
||||||
else:
|
else:
|
||||||
raise ValueError(f"Unknown value type for {k}: {v}")
|
raise ValueError(f"Unknown value type for {k}: {v}")
|
||||||
|
|
||||||
self._kv_overrides_array[
|
self._kv_overrides_array[-1].key = (
|
||||||
-1
|
b"\0" # ensure sentinel element is zeroed
|
||||||
].key = b"\0" # ensure sentinel element is zeroed
|
)
|
||||||
self.model_params.kv_overrides = self._kv_overrides_array
|
self.model_params.kv_overrides = self._kv_overrides_array
|
||||||
|
|
||||||
self.n_batch = min(n_ctx, n_batch) # ???
|
self.n_batch = min(n_ctx, n_batch) # ???
|
||||||
|
@ -256,7 +258,7 @@ class Llama:
|
||||||
self.n_threads_batch = n_threads_batch or max(
|
self.n_threads_batch = n_threads_batch or max(
|
||||||
multiprocessing.cpu_count() // 2, 1
|
multiprocessing.cpu_count() // 2, 1
|
||||||
)
|
)
|
||||||
|
|
||||||
# Context Params
|
# Context Params
|
||||||
self.context_params = llama_cpp.llama_context_default_params()
|
self.context_params = llama_cpp.llama_context_default_params()
|
||||||
self.context_params.seed = seed
|
self.context_params.seed = seed
|
||||||
|
@ -289,7 +291,9 @@ class Llama:
|
||||||
)
|
)
|
||||||
self.context_params.yarn_orig_ctx = yarn_orig_ctx if yarn_orig_ctx != 0 else 0
|
self.context_params.yarn_orig_ctx = yarn_orig_ctx if yarn_orig_ctx != 0 else 0
|
||||||
self.context_params.mul_mat_q = mul_mat_q
|
self.context_params.mul_mat_q = mul_mat_q
|
||||||
self.context_params.logits_all = logits_all if draft_model is None else True # Must be set to True for speculative decoding
|
self.context_params.logits_all = (
|
||||||
|
logits_all if draft_model is None else True
|
||||||
|
) # Must be set to True for speculative decoding
|
||||||
self.context_params.embedding = embedding
|
self.context_params.embedding = embedding
|
||||||
self.context_params.offload_kqv = offload_kqv
|
self.context_params.offload_kqv = offload_kqv
|
||||||
|
|
||||||
|
@ -379,8 +383,14 @@ class Llama:
|
||||||
if self.verbose:
|
if self.verbose:
|
||||||
print(f"Model metadata: {self.metadata}", file=sys.stderr)
|
print(f"Model metadata: {self.metadata}", file=sys.stderr)
|
||||||
|
|
||||||
if self.chat_format is None and self.chat_handler is None and "tokenizer.chat_template" in self.metadata:
|
if (
|
||||||
chat_format = llama_chat_format.guess_chat_format_from_gguf_metadata(self.metadata)
|
self.chat_format is None
|
||||||
|
and self.chat_handler is None
|
||||||
|
and "tokenizer.chat_template" in self.metadata
|
||||||
|
):
|
||||||
|
chat_format = llama_chat_format.guess_chat_format_from_gguf_metadata(
|
||||||
|
self.metadata
|
||||||
|
)
|
||||||
|
|
||||||
if chat_format is not None:
|
if chat_format is not None:
|
||||||
self.chat_format = chat_format
|
self.chat_format = chat_format
|
||||||
|
@ -406,9 +416,7 @@ class Llama:
|
||||||
print(f"Using chat bos_token: {bos_token}", file=sys.stderr)
|
print(f"Using chat bos_token: {bos_token}", file=sys.stderr)
|
||||||
|
|
||||||
self.chat_handler = llama_chat_format.Jinja2ChatFormatter(
|
self.chat_handler = llama_chat_format.Jinja2ChatFormatter(
|
||||||
template=template,
|
template=template, eos_token=eos_token, bos_token=bos_token
|
||||||
eos_token=eos_token,
|
|
||||||
bos_token=bos_token
|
|
||||||
).to_chat_handler()
|
).to_chat_handler()
|
||||||
|
|
||||||
if self.chat_format is None and self.chat_handler is None:
|
if self.chat_format is None and self.chat_handler is None:
|
||||||
|
@ -459,7 +467,9 @@ class Llama:
|
||||||
"""
|
"""
|
||||||
return self.tokenizer_.tokenize(text, add_bos, special)
|
return self.tokenizer_.tokenize(text, add_bos, special)
|
||||||
|
|
||||||
def detokenize(self, tokens: List[int], prev_tokens: Optional[List[int]] = None) -> bytes:
|
def detokenize(
|
||||||
|
self, tokens: List[int], prev_tokens: Optional[List[int]] = None
|
||||||
|
) -> bytes:
|
||||||
"""Detokenize a list of tokens.
|
"""Detokenize a list of tokens.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
|
@ -565,7 +575,7 @@ class Llama:
|
||||||
logits[:] = (
|
logits[:] = (
|
||||||
logits_processor(self._input_ids, logits)
|
logits_processor(self._input_ids, logits)
|
||||||
if idx is None
|
if idx is None
|
||||||
else logits_processor(self._input_ids[:idx + 1], logits)
|
else logits_processor(self._input_ids[: idx + 1], logits)
|
||||||
)
|
)
|
||||||
|
|
||||||
sampling_params = _LlamaSamplingParams(
|
sampling_params = _LlamaSamplingParams(
|
||||||
|
@ -707,7 +717,9 @@ class Llama:
|
||||||
|
|
||||||
if self.draft_model is not None:
|
if self.draft_model is not None:
|
||||||
self.input_ids[self.n_tokens : self.n_tokens + len(tokens)] = tokens
|
self.input_ids[self.n_tokens : self.n_tokens + len(tokens)] = tokens
|
||||||
draft_tokens = self.draft_model(self.input_ids[:self.n_tokens + len(tokens)])
|
draft_tokens = self.draft_model(
|
||||||
|
self.input_ids[: self.n_tokens + len(tokens)]
|
||||||
|
)
|
||||||
tokens.extend(
|
tokens.extend(
|
||||||
draft_tokens.astype(int)[
|
draft_tokens.astype(int)[
|
||||||
: self._n_ctx - self.n_tokens - len(tokens)
|
: self._n_ctx - self.n_tokens - len(tokens)
|
||||||
|
@ -792,6 +804,7 @@ class Llama:
|
||||||
|
|
||||||
# decode and fetch embeddings
|
# decode and fetch embeddings
|
||||||
data: List[List[float]] = []
|
data: List[List[float]] = []
|
||||||
|
|
||||||
def decode_batch(n_seq: int):
|
def decode_batch(n_seq: int):
|
||||||
assert self._ctx.ctx is not None
|
assert self._ctx.ctx is not None
|
||||||
llama_cpp.llama_kv_cache_clear(self._ctx.ctx)
|
llama_cpp.llama_kv_cache_clear(self._ctx.ctx)
|
||||||
|
@ -800,9 +813,9 @@ class Llama:
|
||||||
|
|
||||||
# store embeddings
|
# store embeddings
|
||||||
for i in range(n_seq):
|
for i in range(n_seq):
|
||||||
embedding: List[float] = llama_cpp.llama_get_embeddings_ith(self._ctx.ctx, i)[
|
embedding: List[float] = llama_cpp.llama_get_embeddings_ith(
|
||||||
:n_embd
|
self._ctx.ctx, i
|
||||||
]
|
)[:n_embd]
|
||||||
if normalize:
|
if normalize:
|
||||||
norm = float(np.linalg.norm(embedding))
|
norm = float(np.linalg.norm(embedding))
|
||||||
embedding = [v / norm for v in embedding]
|
embedding = [v / norm for v in embedding]
|
||||||
|
@ -1669,12 +1682,13 @@ class Llama:
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
from openai.types.chat import ChatCompletion, ChatCompletionChunk
|
from openai.types.chat import ChatCompletion, ChatCompletionChunk
|
||||||
stream = kwargs.get("stream", False) # type: ignore
|
|
||||||
|
stream = kwargs.get("stream", False) # type: ignore
|
||||||
assert isinstance(stream, bool)
|
assert isinstance(stream, bool)
|
||||||
if stream:
|
if stream:
|
||||||
return (ChatCompletionChunk(**chunk) for chunk in self.create_chat_completion(*args, **kwargs)) # type: ignore
|
return (ChatCompletionChunk(**chunk) for chunk in self.create_chat_completion(*args, **kwargs)) # type: ignore
|
||||||
else:
|
else:
|
||||||
return ChatCompletion(**self.create_chat_completion(*args, **kwargs)) # type: ignore
|
return ChatCompletion(**self.create_chat_completion(*args, **kwargs)) # type: ignore
|
||||||
except ImportError:
|
except ImportError:
|
||||||
raise ImportError(
|
raise ImportError(
|
||||||
"To use create_chat_completion_openai_v1, you must install the openai package."
|
"To use create_chat_completion_openai_v1, you must install the openai package."
|
||||||
|
@ -1804,7 +1818,7 @@ class Llama:
|
||||||
self.input_ids = state.input_ids.copy()
|
self.input_ids = state.input_ids.copy()
|
||||||
self.n_tokens = state.n_tokens
|
self.n_tokens = state.n_tokens
|
||||||
state_size = state.llama_state_size
|
state_size = state.llama_state_size
|
||||||
LLamaStateArrayType = llama_cpp.c_uint8 * state_size
|
LLamaStateArrayType = ctypes.c_uint8 * state_size
|
||||||
llama_state = LLamaStateArrayType.from_buffer_copy(state.llama_state)
|
llama_state = LLamaStateArrayType.from_buffer_copy(state.llama_state)
|
||||||
|
|
||||||
if llama_cpp.llama_set_state_data(self._ctx.ctx, llama_state) != state_size:
|
if llama_cpp.llama_set_state_data(self._ctx.ctx, llama_state) != state_size:
|
||||||
|
@ -1866,7 +1880,100 @@ class Llama:
|
||||||
break
|
break
|
||||||
return longest_prefix
|
return longest_prefix
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(
|
||||||
|
cls,
|
||||||
|
repo_id: str,
|
||||||
|
filename: Optional[str],
|
||||||
|
local_dir: Optional[Union[str, os.PathLike[str]]] = None,
|
||||||
|
local_dir_use_symlinks: Union[bool, Literal["auto"]] = "auto",
|
||||||
|
cache_dir: Optional[Union[str, os.PathLike[str]]] = None,
|
||||||
|
**kwargs: Any,
|
||||||
|
) -> "Llama":
|
||||||
|
"""Create a Llama model from a pretrained model name or path.
|
||||||
|
This method requires the huggingface-hub package.
|
||||||
|
You can install it with `pip install huggingface-hub`.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
repo_id: The model repo id.
|
||||||
|
filename: A filename or glob pattern to match the model file in the repo.
|
||||||
|
local_dir: The local directory to save the model to.
|
||||||
|
local_dir_use_symlinks: Whether to use symlinks when downloading the model.
|
||||||
|
**kwargs: Additional keyword arguments to pass to the Llama constructor.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A Llama model."""
|
||||||
|
try:
|
||||||
|
from huggingface_hub import hf_hub_download, HfFileSystem
|
||||||
|
from huggingface_hub.utils import validate_repo_id
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(
|
||||||
|
"Llama.from_pretrained requires the huggingface-hub package. "
|
||||||
|
"You can install it with `pip install huggingface-hub`."
|
||||||
|
)
|
||||||
|
|
||||||
|
validate_repo_id(repo_id)
|
||||||
|
|
||||||
|
hffs = HfFileSystem()
|
||||||
|
|
||||||
|
files = [
|
||||||
|
file["name"] if isinstance(file, dict) else file
|
||||||
|
for file in hffs.ls(repo_id)
|
||||||
|
]
|
||||||
|
|
||||||
|
# split each file into repo_id, subfolder, filename
|
||||||
|
file_list: List[str] = []
|
||||||
|
for file in files:
|
||||||
|
rel_path = Path(file).relative_to(repo_id)
|
||||||
|
file_list.append(str(rel_path))
|
||||||
|
|
||||||
|
matching_files = [file for file in file_list if fnmatch.fnmatch(file, filename)] # type: ignore
|
||||||
|
|
||||||
|
if len(matching_files) == 0:
|
||||||
|
raise ValueError(
|
||||||
|
f"No file found in {repo_id} that match {filename}\n\n"
|
||||||
|
f"Available Files:\n{json.dumps(file_list)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if len(matching_files) > 1:
|
||||||
|
raise ValueError(
|
||||||
|
f"Multiple files found in {repo_id} matching {filename}\n\n"
|
||||||
|
f"Available Files:\n{json.dumps(files)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
(matching_file,) = matching_files
|
||||||
|
|
||||||
|
subfolder = str(Path(matching_file).parent)
|
||||||
|
filename = Path(matching_file).name
|
||||||
|
|
||||||
|
# download the file
|
||||||
|
hf_hub_download(
|
||||||
|
repo_id=repo_id,
|
||||||
|
filename=filename,
|
||||||
|
subfolder=subfolder,
|
||||||
|
local_dir=local_dir,
|
||||||
|
local_dir_use_symlinks=local_dir_use_symlinks,
|
||||||
|
cache_dir=cache_dir,
|
||||||
|
)
|
||||||
|
|
||||||
|
if local_dir is None:
|
||||||
|
model_path = hf_hub_download(
|
||||||
|
repo_id=repo_id,
|
||||||
|
filename=filename,
|
||||||
|
subfolder=subfolder,
|
||||||
|
local_dir=local_dir,
|
||||||
|
local_dir_use_symlinks=local_dir_use_symlinks,
|
||||||
|
cache_dir=cache_dir,
|
||||||
|
local_files_only=True,
|
||||||
|
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
model_path = os.path.join(local_dir, filename)
|
||||||
|
|
||||||
|
return cls(
|
||||||
|
model_path=model_path,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
class LlamaState:
|
class LlamaState:
|
||||||
|
|
|
@ -14,6 +14,7 @@ import llama_cpp.llama as llama
|
||||||
import llama_cpp.llama_types as llama_types
|
import llama_cpp.llama_types as llama_types
|
||||||
import llama_cpp.llama_grammar as llama_grammar
|
import llama_cpp.llama_grammar as llama_grammar
|
||||||
|
|
||||||
|
from ._logger import logger
|
||||||
from ._utils import suppress_stdout_stderr, Singleton
|
from ._utils import suppress_stdout_stderr, Singleton
|
||||||
|
|
||||||
### Common Chat Templates and Special Tokens ###
|
### Common Chat Templates and Special Tokens ###
|
||||||
|
@ -993,6 +994,26 @@ def format_saiga(
|
||||||
return ChatFormatterResponse(prompt=_prompt.strip())
|
return ChatFormatterResponse(prompt=_prompt.strip())
|
||||||
|
|
||||||
|
|
||||||
|
# Chat format for Google's Gemma models, see more details and available models:
|
||||||
|
# https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b
|
||||||
|
@register_chat_format("gemma")
|
||||||
|
def format_gemma(
|
||||||
|
messages: List[llama_types.ChatCompletionRequestMessage],
|
||||||
|
**kwargs: Any,
|
||||||
|
) -> ChatFormatterResponse:
|
||||||
|
system_message = _get_system_message(messages)
|
||||||
|
if system_message is not None and system_message != "":
|
||||||
|
logger.debug(
|
||||||
|
"`role='system'` messages are not allowed on Google's Gemma models."
|
||||||
|
)
|
||||||
|
_roles = dict(user="<start_of_turn>user\n", assistant="<start_of_turn>model\n")
|
||||||
|
_sep = "<end_of_turn>\n"
|
||||||
|
_messages = _map_roles(messages, _roles)
|
||||||
|
_messages.append((_roles["assistant"], None))
|
||||||
|
_prompt = _format_no_colon_single(system_message="", messages=_messages, sep=_sep)
|
||||||
|
return ChatFormatterResponse(prompt=_prompt, stop=_sep)
|
||||||
|
|
||||||
|
|
||||||
# Tricky chat formats that require custom chat handlers
|
# Tricky chat formats that require custom chat handlers
|
||||||
|
|
||||||
|
|
||||||
|
|
File diff suppressed because it is too large
Load diff
|
@ -1498,9 +1498,21 @@ class SchemaConverter:
|
||||||
item_rule_name = self.visit(
|
item_rule_name = self.visit(
|
||||||
schema["items"], f'{name}{"-" if name else ""}item'
|
schema["items"], f'{name}{"-" if name else ""}item'
|
||||||
)
|
)
|
||||||
rule = (
|
list_item_operator = f'("," space {item_rule_name})'
|
||||||
f'"[" space ({item_rule_name} ("," space {item_rule_name})*)? "]" space'
|
successive_items = ""
|
||||||
)
|
min_items = schema.get("minItems", 0)
|
||||||
|
if min_items > 0:
|
||||||
|
first_item = f"({item_rule_name})"
|
||||||
|
successive_items = list_item_operator * (min_items - 1)
|
||||||
|
min_items -= 1
|
||||||
|
else:
|
||||||
|
first_item = f"({item_rule_name})?"
|
||||||
|
max_items = schema.get("maxItems")
|
||||||
|
if max_items is not None and max_items > min_items:
|
||||||
|
successive_items += (list_item_operator + "?") * (max_items - min_items - 1)
|
||||||
|
else:
|
||||||
|
successive_items += list_item_operator + "*"
|
||||||
|
rule = f'"[" space {first_item} {successive_items} "]" space'
|
||||||
return self._add_rule(rule_name, rule)
|
return self._add_rule(rule_name, rule)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
|
|
|
@ -5,21 +5,15 @@ from ctypes import (
|
||||||
c_bool,
|
c_bool,
|
||||||
c_char_p,
|
c_char_p,
|
||||||
c_int,
|
c_int,
|
||||||
c_int8,
|
|
||||||
c_int32,
|
|
||||||
c_uint8,
|
c_uint8,
|
||||||
c_uint32,
|
|
||||||
c_size_t,
|
|
||||||
c_float,
|
c_float,
|
||||||
c_double,
|
|
||||||
c_void_p,
|
c_void_p,
|
||||||
POINTER,
|
POINTER,
|
||||||
_Pointer, # type: ignore
|
_Pointer, # type: ignore
|
||||||
Structure,
|
Structure,
|
||||||
Array,
|
|
||||||
)
|
)
|
||||||
import pathlib
|
import pathlib
|
||||||
from typing import List, Union
|
from typing import List, Union, NewType, Optional
|
||||||
|
|
||||||
import llama_cpp.llama_cpp as llama_cpp
|
import llama_cpp.llama_cpp as llama_cpp
|
||||||
|
|
||||||
|
@ -67,7 +61,7 @@ def _load_shared_library(lib_base_name: str):
|
||||||
for _lib_path in _lib_paths:
|
for _lib_path in _lib_paths:
|
||||||
if _lib_path.exists():
|
if _lib_path.exists():
|
||||||
try:
|
try:
|
||||||
return ctypes.CDLL(str(_lib_path), **cdll_args)
|
return ctypes.CDLL(str(_lib_path), **cdll_args) # type: ignore
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}")
|
raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}")
|
||||||
|
|
||||||
|
@ -88,7 +82,8 @@ _libllava = _load_shared_library(_libllava_base_name)
|
||||||
################################################
|
################################################
|
||||||
|
|
||||||
# struct clip_ctx;
|
# struct clip_ctx;
|
||||||
clip_ctx_p = c_void_p
|
clip_ctx_p = NewType("clip_ctx_p", int)
|
||||||
|
clip_ctx_p_ctypes = c_void_p
|
||||||
|
|
||||||
# struct llava_image_embed {
|
# struct llava_image_embed {
|
||||||
# float * embed;
|
# float * embed;
|
||||||
|
@ -102,43 +97,48 @@ class llava_image_embed(Structure):
|
||||||
|
|
||||||
# /** sanity check for clip <-> llava embed size match */
|
# /** sanity check for clip <-> llava embed size match */
|
||||||
# LLAVA_API bool llava_validate_embed_size(const llama_context * ctx_llama, const clip_ctx * ctx_clip);
|
# LLAVA_API bool llava_validate_embed_size(const llama_context * ctx_llama, const clip_ctx * ctx_clip);
|
||||||
def llava_validate_embed_size(ctx_llama: llama_cpp.llama_context_p, ctx_clip: clip_ctx_p) -> bool:
|
def llava_validate_embed_size(ctx_llama: llama_cpp.llama_context_p, ctx_clip: clip_ctx_p, /) -> bool:
|
||||||
return _libllava.llava_validate_embed_size(ctx_llama, ctx_clip)
|
...
|
||||||
|
|
||||||
_libllava.llava_validate_embed_size.argtypes = [llama_cpp.llama_context_p, clip_ctx_p]
|
llava_validate_embed_size = _libllava.llava_validate_embed_size
|
||||||
_libllava.llava_validate_embed_size.restype = c_bool
|
llava_validate_embed_size.argtypes = [llama_cpp.llama_context_p_ctypes, clip_ctx_p_ctypes]
|
||||||
|
llava_validate_embed_size.restype = c_bool
|
||||||
|
|
||||||
# /** build an image embed from image file bytes */
|
# /** build an image embed from image file bytes */
|
||||||
# LLAVA_API struct llava_image_embed * llava_image_embed_make_with_bytes(struct clip_ctx * ctx_clip, int n_threads, const unsigned char * image_bytes, int image_bytes_length);
|
# LLAVA_API struct llava_image_embed * llava_image_embed_make_with_bytes(struct clip_ctx * ctx_clip, int n_threads, const unsigned char * image_bytes, int image_bytes_length);
|
||||||
def llava_image_embed_make_with_bytes(ctx_clip: clip_ctx_p, n_threads: Union[c_int, int], image_bytes: bytes, image_bytes_length: Union[c_int, int]) -> "_Pointer[llava_image_embed]":
|
def llava_image_embed_make_with_bytes(ctx_clip: clip_ctx_p, n_threads: Union[c_int, int], image_bytes: bytes, image_bytes_length: Union[c_int, int], /) -> "_Pointer[llava_image_embed]":
|
||||||
return _libllava.llava_image_embed_make_with_bytes(ctx_clip, n_threads, image_bytes, image_bytes_length)
|
...
|
||||||
|
|
||||||
_libllava.llava_image_embed_make_with_bytes.argtypes = [clip_ctx_p, c_int, POINTER(c_uint8), c_int]
|
llava_image_embed_make_with_bytes = _libllava.llava_image_embed_make_with_bytes
|
||||||
_libllava.llava_image_embed_make_with_bytes.restype = POINTER(llava_image_embed)
|
llava_image_embed_make_with_bytes.argtypes = [clip_ctx_p_ctypes, c_int, POINTER(c_uint8), c_int]
|
||||||
|
llava_image_embed_make_with_bytes.restype = POINTER(llava_image_embed)
|
||||||
|
|
||||||
# /** build an image embed from a path to an image filename */
|
# /** build an image embed from a path to an image filename */
|
||||||
# LLAVA_API struct llava_image_embed * llava_image_embed_make_with_filename(struct clip_ctx * ctx_clip, int n_threads, const char * image_path);
|
# LLAVA_API struct llava_image_embed * llava_image_embed_make_with_filename(struct clip_ctx * ctx_clip, int n_threads, const char * image_path);
|
||||||
def llava_image_embed_make_with_filename(ctx_clip: clip_ctx_p, n_threads: Union[c_int, int], image_path: bytes) -> "_Pointer[llava_image_embed]":
|
def llava_image_embed_make_with_filename(ctx_clip: clip_ctx_p, n_threads: Union[c_int, int], image_path: bytes, /) -> "_Pointer[llava_image_embed]":
|
||||||
return _libllava.llava_image_embed_make_with_filename(ctx_clip, n_threads, image_path)
|
...
|
||||||
|
|
||||||
_libllava.llava_image_embed_make_with_filename.argtypes = [clip_ctx_p, c_int, c_char_p]
|
llava_image_embed_make_with_filename = _libllava.llava_image_embed_make_with_filename
|
||||||
_libllava.llava_image_embed_make_with_filename.restype = POINTER(llava_image_embed)
|
llava_image_embed_make_with_filename.argtypes = [clip_ctx_p_ctypes, c_int, c_char_p]
|
||||||
|
llava_image_embed_make_with_filename.restype = POINTER(llava_image_embed)
|
||||||
|
|
||||||
# LLAVA_API void llava_image_embed_free(struct llava_image_embed * embed);
|
# LLAVA_API void llava_image_embed_free(struct llava_image_embed * embed);
|
||||||
# /** free an embedding made with llava_image_embed_make_* */
|
# /** free an embedding made with llava_image_embed_make_* */
|
||||||
def llava_image_embed_free(embed: "_Pointer[llava_image_embed]"):
|
def llava_image_embed_free(embed: "_Pointer[llava_image_embed]", /):
|
||||||
return _libllava.llava_image_embed_free(embed)
|
...
|
||||||
|
|
||||||
_libllava.llava_image_embed_free.argtypes = [POINTER(llava_image_embed)]
|
llava_image_embed_free = _libllava.llava_image_embed_free
|
||||||
_libllava.llava_image_embed_free.restype = None
|
llava_image_embed_free.argtypes = [POINTER(llava_image_embed)]
|
||||||
|
llava_image_embed_free.restype = None
|
||||||
|
|
||||||
# /** write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed. */
|
# /** write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed. */
|
||||||
# LLAVA_API bool llava_eval_image_embed(struct llama_context * ctx_llama, const struct llava_image_embed * embed, int n_batch, int * n_past);
|
# LLAVA_API bool llava_eval_image_embed(struct llama_context * ctx_llama, const struct llava_image_embed * embed, int n_batch, int * n_past);
|
||||||
def llava_eval_image_embed(ctx_llama: llama_cpp.llama_context_p, embed: "_Pointer[llava_image_embed]", n_batch: Union[c_int, int], n_past: "_Pointer[c_int]") -> bool:
|
def llava_eval_image_embed(ctx_llama: llama_cpp.llama_context_p, embed: "_Pointer[llava_image_embed]", n_batch: Union[c_int, int], n_past: "_Pointer[c_int]", /) -> bool:
|
||||||
return _libllava.llava_eval_image_embed(ctx_llama, embed, n_batch, n_past)
|
...
|
||||||
|
|
||||||
_libllava.llava_eval_image_embed.argtypes = [llama_cpp.llama_context_p, POINTER(llava_image_embed), c_int, POINTER(c_int)]
|
llava_eval_image_embed = _libllava.llava_eval_image_embed
|
||||||
_libllava.llava_eval_image_embed.restype = c_bool
|
llava_eval_image_embed.argtypes = [llama_cpp.llama_context_p_ctypes, POINTER(llava_image_embed), c_int, POINTER(c_int)]
|
||||||
|
llava_eval_image_embed.restype = c_bool
|
||||||
|
|
||||||
|
|
||||||
################################################
|
################################################
|
||||||
|
@ -148,16 +148,18 @@ _libllava.llava_eval_image_embed.restype = c_bool
|
||||||
|
|
||||||
# /** load mmproj model */
|
# /** load mmproj model */
|
||||||
# CLIP_API struct clip_ctx * clip_model_load (const char * fname, int verbosity);
|
# CLIP_API struct clip_ctx * clip_model_load (const char * fname, int verbosity);
|
||||||
def clip_model_load(fname: bytes, verbosity: Union[c_int, int]) -> clip_ctx_p:
|
def clip_model_load(fname: bytes, verbosity: Union[c_int, int], /) -> Optional[clip_ctx_p]:
|
||||||
return _libllava.clip_model_load(fname, verbosity)
|
...
|
||||||
|
|
||||||
_libllava.clip_model_load.argtypes = [c_char_p, c_int]
|
clip_model_load = _libllava.clip_model_load
|
||||||
_libllava.clip_model_load.restype = clip_ctx_p
|
clip_model_load.argtypes = [c_char_p, c_int]
|
||||||
|
clip_model_load.restype = clip_ctx_p_ctypes
|
||||||
|
|
||||||
# /** free mmproj model */
|
# /** free mmproj model */
|
||||||
# CLIP_API void clip_free(struct clip_ctx * ctx);
|
# CLIP_API void clip_free(struct clip_ctx * ctx);
|
||||||
def clip_free(ctx: clip_ctx_p):
|
def clip_free(ctx: clip_ctx_p, /):
|
||||||
return _libllava.clip_free(ctx)
|
...
|
||||||
|
|
||||||
_libllava.clip_free.argtypes = [clip_ctx_p]
|
clip_free = _libllava.clip_free
|
||||||
_libllava.clip_free.restype = None
|
clip_free.argtypes = [clip_ctx_p_ctypes]
|
||||||
|
clip_free.restype = None
|
||||||
|
|
|
@ -72,4 +72,4 @@ Documentation = "https://llama-cpp-python.readthedocs.io/en/latest/"
|
||||||
Changelog = "https://llama-cpp-python.readthedocs.io/en/latest/changelog/"
|
Changelog = "https://llama-cpp-python.readthedocs.io/en/latest/changelog/"
|
||||||
|
|
||||||
[tool.pytest.ini_options]
|
[tool.pytest.ini_options]
|
||||||
addopts = "--ignore=vendor"
|
testpaths = "tests"
|
||||||
|
|
|
@ -54,7 +54,7 @@ def mock_llama(monkeypatch):
|
||||||
output_tokens = llama.tokenize(
|
output_tokens = llama.tokenize(
|
||||||
output_text.encode("utf-8"), add_bos=True, special=True
|
output_text.encode("utf-8"), add_bos=True, special=True
|
||||||
)
|
)
|
||||||
logits = (llama_cpp.c_float * (n_vocab * n_ctx))(-100.0)
|
logits = (ctypes.c_float * (n_vocab * n_ctx))(-100.0)
|
||||||
for i in range(n_ctx):
|
for i in range(n_ctx):
|
||||||
output_idx = i + 1 # logits for first tokens predict second token
|
output_idx = i + 1 # logits for first tokens predict second token
|
||||||
if output_idx < len(output_tokens):
|
if output_idx < len(output_tokens):
|
||||||
|
@ -90,9 +90,9 @@ def mock_llama(monkeypatch):
|
||||||
assert n > 0, "mock_llama_decode not called"
|
assert n > 0, "mock_llama_decode not called"
|
||||||
assert last_n_tokens > 0, "mock_llama_decode not called"
|
assert last_n_tokens > 0, "mock_llama_decode not called"
|
||||||
# Return view of logits for last_n_tokens
|
# Return view of logits for last_n_tokens
|
||||||
return (llama_cpp.c_float * (last_n_tokens * n_vocab)).from_address(
|
return (ctypes.c_float * (last_n_tokens * n_vocab)).from_address(
|
||||||
ctypes.addressof(logits)
|
ctypes.addressof(logits)
|
||||||
+ (n - last_n_tokens) * n_vocab * ctypes.sizeof(llama_cpp.c_float)
|
+ (n - last_n_tokens) * n_vocab * ctypes.sizeof(ctypes.c_float)
|
||||||
)
|
)
|
||||||
|
|
||||||
monkeypatch.setattr("llama_cpp.llama_cpp.llama_decode", mock_decode)
|
monkeypatch.setattr("llama_cpp.llama_cpp.llama_decode", mock_decode)
|
||||||
|
|
2
vendor/llama.cpp
vendored
2
vendor/llama.cpp
vendored
|
@ -1 +1 @@
|
||||||
Subproject commit f53119cec4f073b6d214195ecbe1fad3abdf2b34
|
Subproject commit 15499eb94227401bdc8875da6eb85c15d37068f7
|
Loading…
Reference in a new issue