This commit is contained in:
commit
cd66f3cfb4
13 changed files with 322 additions and 54 deletions
19
CHANGELOG.md
19
CHANGELOG.md
|
@ -7,6 +7,25 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
## [0.2.37]
|
||||||
|
|
||||||
|
- feat: Update llama.cpp to ggerganov/llama.cpp@fea4fd4ba7f6b754ac795387b275e1a014a77bde
|
||||||
|
- feat: Automatically set chat format from gguf by @abetlen in #1110
|
||||||
|
|
||||||
|
## [0.2.36]
|
||||||
|
|
||||||
|
- feat: Update llama.cpp to ggerganov/llama.cpp@2aed77eb06a329f0d82bb1c467f4244904d4073f
|
||||||
|
- feat: Add mistral instruct chat format as "mistral-instruct" by @Rafaelblsilva in #799
|
||||||
|
|
||||||
|
## [0.2.35]
|
||||||
|
|
||||||
|
- feat: Update llama.cpp to ggerganov/llama.cpp@d2f650cb5b04ee2726663e79b47da5efe196ce00
|
||||||
|
|
||||||
|
## [0.2.34]
|
||||||
|
|
||||||
|
- feat: Update llama.cpp to ggerganov/llama.cpp@6db2b41a76ee78d5efdd5c3cddd5d7ad3f646855
|
||||||
|
- feat: Add json schema mode by @abetlen in #1122
|
||||||
|
|
||||||
## [0.2.33]
|
## [0.2.33]
|
||||||
|
|
||||||
- feat: Update llama.cpp to ggerganov/llama.cpp@faa3526a1eba458120987ed8269e5616385a76f4
|
- feat: Update llama.cpp to ggerganov/llama.cpp@faa3526a1eba458120987ed8269e5616385a76f4
|
||||||
|
|
9
Makefile
9
Makefile
|
@ -27,6 +27,15 @@ build.blis:
|
||||||
build.metal:
|
build.metal:
|
||||||
CMAKE_ARGS="-DLLAMA_METAL=on" python3 -m pip install --verbose -e .
|
CMAKE_ARGS="-DLLAMA_METAL=on" python3 -m pip install --verbose -e .
|
||||||
|
|
||||||
|
build.vulkan:
|
||||||
|
CMAKE_ARGS="-DLLAMA_VULKAN=on" python3 -m pip install --verbose -e .
|
||||||
|
|
||||||
|
build.kompute:
|
||||||
|
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" python3 -m pip install --verbose -e .
|
||||||
|
|
||||||
|
build.sycl:
|
||||||
|
CMAKE_ARGS="-DLLAMA_SYCL=on" python3 -m pip install --verbose -e .
|
||||||
|
|
||||||
build.sdist:
|
build.sdist:
|
||||||
python3 -m build --sdist
|
python3 -m build --sdist
|
||||||
|
|
||||||
|
|
104
README.md
104
README.md
|
@ -12,20 +12,17 @@ This package provides:
|
||||||
|
|
||||||
- Low-level access to C API via `ctypes` interface.
|
- Low-level access to C API via `ctypes` interface.
|
||||||
- High-level Python API for text completion
|
- High-level Python API for text completion
|
||||||
- OpenAI-like API
|
- OpenAI-like API
|
||||||
- [LangChain compatibility](https://python.langchain.com/docs/integrations/llms/llamacpp)
|
- [LangChain compatibility](https://python.langchain.com/docs/integrations/llms/llamacpp)
|
||||||
- [LlamaIndex compatibility](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html)
|
- [LlamaIndex compatibility](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html)
|
||||||
- OpenAI compatible web server
|
- OpenAI compatible web server
|
||||||
- [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)
|
- [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)
|
||||||
- [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)
|
- [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)
|
||||||
- [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)
|
- [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)
|
||||||
- [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)
|
- [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)
|
||||||
|
|
||||||
Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).
|
Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
`llama-cpp-python` can be installed directly from PyPI as a source distribution by running:
|
`llama-cpp-python` can be installed directly from PyPI as a source distribution by running:
|
||||||
|
@ -38,7 +35,6 @@ This will build `llama.cpp` from source using cmake and your system's c compiler
|
||||||
|
|
||||||
If you run into issues during installation add the `--verbose` flag to the `pip install` command to see the full cmake build log.
|
If you run into issues during installation add the `--verbose` flag to the `pip install` command to see the full cmake build log.
|
||||||
|
|
||||||
|
|
||||||
### Installation with Specific Hardware Acceleration (BLAS, CUDA, Metal, etc)
|
### Installation with Specific Hardware Acceleration (BLAS, CUDA, Metal, etc)
|
||||||
|
|
||||||
The default pip install behaviour is to build `llama.cpp` for CPU only on Linux and Windows and use Metal on MacOS.
|
The default pip install behaviour is to build `llama.cpp` for CPU only on Linux and Windows and use Metal on MacOS.
|
||||||
|
@ -71,7 +67,7 @@ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-
|
||||||
|
|
||||||
#### cuBLAS
|
#### cuBLAS
|
||||||
|
|
||||||
To install with cuBLAS, set the `LLAMA_CUBLAS=1` environment variable before installing:
|
To install with cuBLAS, set the `LLAMA_CUBLAS=on` environment variable before installing:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
|
||||||
|
@ -87,7 +83,7 @@ CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
|
||||||
|
|
||||||
#### CLBlast
|
#### CLBlast
|
||||||
|
|
||||||
To install with CLBlast, set the `LLAMA_CLBLAST=1` environment variable before installing:
|
To install with CLBlast, set the `LLAMA_CLBLAST=on` environment variable before installing:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
|
||||||
|
@ -101,6 +97,30 @@ To install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on`
|
||||||
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
|
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Vulkan
|
||||||
|
|
||||||
|
To install with Vulkan support, set the `LLAMA_VULKAN=on` environment variable before installing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Kompute
|
||||||
|
|
||||||
|
To install with Kompute support, set the `LLAMA_KOMPUTE=on` environment variable before installing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
|
||||||
|
```
|
||||||
|
|
||||||
|
#### SYCL
|
||||||
|
|
||||||
|
To install with SYCL support, set the `LLAMA_SYCL=on` environment variable before installing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CMAKE_ARGS="-DLLAMA_SYCL=on" pip install llama-cpp-python
|
||||||
|
```
|
||||||
|
|
||||||
### Windows Notes
|
### Windows Notes
|
||||||
|
|
||||||
If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:
|
If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:
|
||||||
|
@ -216,6 +236,59 @@ Note that `chat_format` option must be set for the particular model you are usin
|
||||||
|
|
||||||
Chat completion is available through the [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) method of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
|
Chat completion is available through the [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) method of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
|
||||||
|
|
||||||
|
### JSON and JSON Schema Mode
|
||||||
|
|
||||||
|
If you want to constrain chat responses to only valid JSON or a specific JSON Schema you can use the `response_format` argument to the `create_chat_completion` method.
|
||||||
|
|
||||||
|
#### JSON Mode
|
||||||
|
|
||||||
|
The following example will constrain the response to be valid JSON.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from llama_cpp import Llama
|
||||||
|
>>> llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
|
||||||
|
>>> llm.create_chat_completion(
|
||||||
|
messages=[
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": "You are a helpful assistant that outputs in JSON.",
|
||||||
|
},
|
||||||
|
{"role": "user", "content": "Who won the world series in 2020"},
|
||||||
|
],
|
||||||
|
response_format={
|
||||||
|
"type": "json_object",
|
||||||
|
},
|
||||||
|
temperature=0.7,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### JSON Schema Mode
|
||||||
|
|
||||||
|
To constrain the response to a specific JSON Schema, you can use the `schema` property of the `response_format` argument.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from llama_cpp import Llama
|
||||||
|
>>> llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
|
||||||
|
>>> llm.create_chat_completion(
|
||||||
|
messages=[
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": "You are a helpful assistant that outputs in JSON.",
|
||||||
|
},
|
||||||
|
{"role": "user", "content": "Who won the world series in 2020"},
|
||||||
|
],
|
||||||
|
response_format={
|
||||||
|
"type": "json_object",
|
||||||
|
"schema": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {"team_name": {"type": "string"}},
|
||||||
|
"required": ["team_name"],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
temperature=0.7,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
### Function Calling
|
### Function Calling
|
||||||
|
|
||||||
The high-level API also provides a simple interface for function calling.
|
The high-level API also provides a simple interface for function calling.
|
||||||
|
@ -223,7 +296,6 @@ The high-level API also provides a simple interface for function calling.
|
||||||
Note that the only model that supports full function calling at this time is "functionary".
|
Note that the only model that supports full function calling at this time is "functionary".
|
||||||
The gguf-converted files for this model can be found here: [functionary-7b-v1](https://huggingface.co/abetlen/functionary-7b-v1-GGUF)
|
The gguf-converted files for this model can be found here: [functionary-7b-v1](https://huggingface.co/abetlen/functionary-7b-v1-GGUF)
|
||||||
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from llama_cpp import Llama
|
>>> from llama_cpp import Llama
|
||||||
>>> llm = Llama(model_path="path/to/functionary/llama-model.gguf", chat_format="functionary")
|
>>> llm = Llama(model_path="path/to/functionary/llama-model.gguf", chat_format="functionary")
|
||||||
|
@ -271,7 +343,6 @@ The gguf-converted files for this model can be found here: [functionary-7b-v1](h
|
||||||
|
|
||||||
### Multi-modal Models
|
### Multi-modal Models
|
||||||
|
|
||||||
|
|
||||||
`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
|
`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
|
||||||
read information from both text and images.
|
read information from both text and images.
|
||||||
|
|
||||||
|
@ -317,7 +388,6 @@ For instance, if you want to work with larger contexts, you can expand the conte
|
||||||
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
|
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## OpenAI Compatible Web Server
|
## OpenAI Compatible Web Server
|
||||||
|
|
||||||
`llama-cpp-python` offers a web server which aims to act as a drop-in replacement for the OpenAI API.
|
`llama-cpp-python` offers a web server which aims to act as a drop-in replacement for the OpenAI API.
|
||||||
|
@ -365,6 +435,7 @@ A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python).
|
||||||
```bash
|
```bash
|
||||||
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
|
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
|
||||||
```
|
```
|
||||||
|
|
||||||
[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)
|
[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)
|
||||||
|
|
||||||
## Low-level API
|
## Low-level API
|
||||||
|
@ -393,7 +464,6 @@ Below is a short example demonstrating how to use the low-level API to tokenize
|
||||||
|
|
||||||
Check out the [examples folder](examples/low_level_api) for more examples of using the low-level API.
|
Check out the [examples folder](examples/low_level_api) for more examples of using the low-level API.
|
||||||
|
|
||||||
|
|
||||||
## Documentation
|
## Documentation
|
||||||
|
|
||||||
Documentation is available via [https://llama-cpp-python.readthedocs.io/](https://llama-cpp-python.readthedocs.io/).
|
Documentation is available via [https://llama-cpp-python.readthedocs.io/](https://llama-cpp-python.readthedocs.io/).
|
||||||
|
|
|
@ -9,7 +9,7 @@ export MODEL=../models/7B/...
|
||||||
|
|
||||||
Then run:
|
Then run:
|
||||||
```
|
```
|
||||||
uvicorn llama_cpp.server.app:app --reload
|
uvicorn --factory llama_cpp.server.app:create_app --reload
|
||||||
```
|
```
|
||||||
|
|
||||||
or
|
or
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from .llama_cpp import *
|
from .llama_cpp import *
|
||||||
from .llama import *
|
from .llama import *
|
||||||
|
|
||||||
__version__ = "0.2.33"
|
__version__ = "0.2.37"
|
|
@ -216,13 +216,13 @@ class _LlamaModel:
|
||||||
for i in range(llama_cpp.llama_model_meta_count(self.model)):
|
for i in range(llama_cpp.llama_model_meta_count(self.model)):
|
||||||
nbytes = llama_cpp.llama_model_meta_key_by_index(self.model, i, buffer, buffer_size)
|
nbytes = llama_cpp.llama_model_meta_key_by_index(self.model, i, buffer, buffer_size)
|
||||||
if nbytes > buffer_size:
|
if nbytes > buffer_size:
|
||||||
buffer_size = nbytes
|
buffer_size = nbytes + 1
|
||||||
buffer = ctypes.create_string_buffer(buffer_size)
|
buffer = ctypes.create_string_buffer(buffer_size)
|
||||||
nbytes = llama_cpp.llama_model_meta_key_by_index(self.model, i, buffer, buffer_size)
|
nbytes = llama_cpp.llama_model_meta_key_by_index(self.model, i, buffer, buffer_size)
|
||||||
key = buffer.value.decode("utf-8")
|
key = buffer.value.decode("utf-8")
|
||||||
nbytes = llama_cpp.llama_model_meta_val_str_by_index(self.model, i, buffer, buffer_size)
|
nbytes = llama_cpp.llama_model_meta_val_str_by_index(self.model, i, buffer, buffer_size)
|
||||||
if nbytes > buffer_size:
|
if nbytes > buffer_size:
|
||||||
buffer_size = nbytes
|
buffer_size = nbytes + 1
|
||||||
buffer = ctypes.create_string_buffer(buffer_size)
|
buffer = ctypes.create_string_buffer(buffer_size)
|
||||||
nbytes = llama_cpp.llama_model_meta_val_str_by_index(self.model, i, buffer, buffer_size)
|
nbytes = llama_cpp.llama_model_meta_val_str_by_index(self.model, i, buffer, buffer_size)
|
||||||
value = buffer.value.decode("utf-8")
|
value = buffer.value.decode("utf-8")
|
||||||
|
|
|
@ -87,7 +87,7 @@ class Llama:
|
||||||
# Backend Params
|
# Backend Params
|
||||||
numa: bool = False,
|
numa: bool = False,
|
||||||
# Chat Format Params
|
# Chat Format Params
|
||||||
chat_format: str = "llama-2",
|
chat_format: Optional[str] = None,
|
||||||
chat_handler: Optional[llama_chat_format.LlamaChatCompletionHandler] = None,
|
chat_handler: Optional[llama_chat_format.LlamaChatCompletionHandler] = None,
|
||||||
# Misc
|
# Misc
|
||||||
verbose: bool = True,
|
verbose: bool = True,
|
||||||
|
@ -343,6 +343,41 @@ class Llama:
|
||||||
if self.verbose:
|
if self.verbose:
|
||||||
print(f"Model metadata: {self.metadata}", file=sys.stderr)
|
print(f"Model metadata: {self.metadata}", file=sys.stderr)
|
||||||
|
|
||||||
|
if self.chat_format is None and self.chat_handler is None and "tokenizer.chat_template" in self.metadata:
|
||||||
|
chat_format = llama_chat_format.guess_chat_format_from_gguf_metadata(self.metadata)
|
||||||
|
|
||||||
|
if chat_format is not None:
|
||||||
|
self.chat_format = chat_format
|
||||||
|
if self.verbose:
|
||||||
|
print(f"Guessed chat format: {chat_format}", file=sys.stderr)
|
||||||
|
else:
|
||||||
|
template = self.metadata["tokenizer.chat_template"]
|
||||||
|
try:
|
||||||
|
eos_token_id = int(self.metadata["tokenizer.ggml.eos_token_id"])
|
||||||
|
except:
|
||||||
|
eos_token_id = self.token_eos()
|
||||||
|
try:
|
||||||
|
bos_token_id = int(self.metadata["tokenizer.ggml.bos_token_id"])
|
||||||
|
except:
|
||||||
|
bos_token_id = self.token_bos()
|
||||||
|
|
||||||
|
eos_token = self.detokenize([eos_token_id]).decode("utf-8")
|
||||||
|
bos_token = self.detokenize([bos_token_id]).decode("utf-8")
|
||||||
|
|
||||||
|
if self.verbose:
|
||||||
|
print(f"Using chat template: {template}", file=sys.stderr)
|
||||||
|
print(f"Using chat eos_token: {eos_token}", file=sys.stderr)
|
||||||
|
print(f"Using chat bos_token: {bos_token}", file=sys.stderr)
|
||||||
|
|
||||||
|
self.chat_handler = llama_chat_format.Jinja2ChatFormatter(
|
||||||
|
template=template,
|
||||||
|
eos_token=eos_token,
|
||||||
|
bos_token=bos_token
|
||||||
|
).to_chat_handler()
|
||||||
|
|
||||||
|
if self.chat_format is None and self.chat_handler is None:
|
||||||
|
self.chat_format = "llama-2"
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def ctx(self) -> llama_cpp.llama_context_p:
|
def ctx(self) -> llama_cpp.llama_context_p:
|
||||||
assert self._ctx.ctx is not None
|
assert self._ctx.ctx is not None
|
||||||
|
|
|
@ -14,6 +14,20 @@ import llama_cpp.llama_grammar as llama_grammar
|
||||||
|
|
||||||
from ._utils import suppress_stdout_stderr, Singleton
|
from ._utils import suppress_stdout_stderr, Singleton
|
||||||
|
|
||||||
|
### Common Chat Templates and Special Tokens ###
|
||||||
|
|
||||||
|
# Source: https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B/blob/main/tokenizer_config.json
|
||||||
|
CHATML_CHAT_TEMPLATE = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
|
||||||
|
CHATML_BOS_TOKEN = "<s>"
|
||||||
|
CHATML_EOS_TOKEN = "<|im_end|>"
|
||||||
|
|
||||||
|
# Source: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/blob/main/tokenizer_config.json
|
||||||
|
MISTRAL_INSTRUCT_CHAT_TEMPLATE = "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"
|
||||||
|
MISTRAL_INSTRUCT_BOS_TOKEN = "<s>"
|
||||||
|
MISTRAL_INSTRUCT_EOS_TOKEN = "</s>"
|
||||||
|
|
||||||
|
|
||||||
|
### Chat Completion Handler ###
|
||||||
|
|
||||||
class LlamaChatCompletionHandler(Protocol):
|
class LlamaChatCompletionHandler(Protocol):
|
||||||
"""Base Protocol for a llama chat completion handler.
|
"""Base Protocol for a llama chat completion handler.
|
||||||
|
@ -118,7 +132,6 @@ def register_chat_completion_handler(name: str):
|
||||||
|
|
||||||
### Chat Formatter ###
|
### Chat Formatter ###
|
||||||
|
|
||||||
|
|
||||||
@dataclasses.dataclass
|
@dataclasses.dataclass
|
||||||
class ChatFormatterResponse:
|
class ChatFormatterResponse:
|
||||||
"""Dataclass that stores completion parameters for a given chat format and
|
"""Dataclass that stores completion parameters for a given chat format and
|
||||||
|
@ -172,16 +185,17 @@ class Jinja2ChatFormatter(ChatFormatter):
|
||||||
messages: List[llama_types.ChatCompletionRequestMessage],
|
messages: List[llama_types.ChatCompletionRequestMessage],
|
||||||
**kwargs: Any,
|
**kwargs: Any,
|
||||||
) -> ChatFormatterResponse:
|
) -> ChatFormatterResponse:
|
||||||
if self.add_generation_prompt:
|
def raise_exception(message: str):
|
||||||
messages = [
|
raise ValueError(message)
|
||||||
*messages,
|
|
||||||
llama_types.ChatCompletionRequestAssistantMessage(
|
|
||||||
role="assistant", content=""
|
|
||||||
),
|
|
||||||
]
|
|
||||||
prompt = self._environment.render(
|
prompt = self._environment.render(
|
||||||
messages=messages, eos_token=self.eos_token, bos_token=self.bos_token
|
messages=messages,
|
||||||
|
eos_token=self.eos_token,
|
||||||
|
bos_token=self.bos_token,
|
||||||
|
raise_exception=raise_exception,
|
||||||
|
add_generation_prompt=self.add_generation_prompt
|
||||||
)
|
)
|
||||||
|
|
||||||
return ChatFormatterResponse(prompt=prompt, stop=[self.eos_token])
|
return ChatFormatterResponse(prompt=prompt, stop=[self.eos_token])
|
||||||
|
|
||||||
def to_chat_handler(self) -> LlamaChatCompletionHandler:
|
def to_chat_handler(self) -> LlamaChatCompletionHandler:
|
||||||
|
@ -318,7 +332,14 @@ def chat_formatter_to_chat_completion_handler(
|
||||||
stop = stop + rstop
|
stop = stop + rstop
|
||||||
|
|
||||||
if response_format is not None and response_format["type"] == "json_object":
|
if response_format is not None and response_format["type"] == "json_object":
|
||||||
grammar = llama_grammar.LlamaGrammar.from_string(llama_grammar.JSON_GBNF)
|
try:
|
||||||
|
# create grammar from json schema
|
||||||
|
if "schema" in response_format:
|
||||||
|
grammar = llama_grammar.LlamaGrammar.from_json_schema(
|
||||||
|
json.dumps(response_format["schema"])
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
grammar = llama_grammar.LlamaGrammar.from_string(llama_grammar.JSON_GBNF)
|
||||||
|
|
||||||
completion_or_chunks = llama.create_completion(
|
completion_or_chunks = llama.create_completion(
|
||||||
prompt=prompt,
|
prompt=prompt,
|
||||||
|
@ -433,7 +454,20 @@ def hf_tokenizer_config_to_chat_completion_handler(
|
||||||
return chat_formatter_to_chat_completion_handler(chat_formatter)
|
return chat_formatter_to_chat_completion_handler(chat_formatter)
|
||||||
|
|
||||||
|
|
||||||
|
def guess_chat_format_from_gguf_metadata(metadata: Dict[str, str]) -> Optional[str]:
|
||||||
|
if "tokenizer.chat_template" not in metadata:
|
||||||
|
return None
|
||||||
|
|
||||||
|
if metadata["tokenizer.chat_template"] == CHATML_CHAT_TEMPLATE:
|
||||||
|
return "chatml"
|
||||||
|
|
||||||
|
if metadata["tokenizer.chat_template"] == MISTRAL_INSTRUCT_CHAT_TEMPLATE:
|
||||||
|
return "mistral-instruct"
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
### Utility functions for formatting chat prompts ###
|
### Utility functions for formatting chat prompts ###
|
||||||
|
# TODO: Replace these with jinja2 templates
|
||||||
|
|
||||||
|
|
||||||
def _get_system_message(
|
def _get_system_message(
|
||||||
|
@ -870,6 +904,24 @@ def format_chatml(
|
||||||
return ChatFormatterResponse(prompt=_prompt, stop=_sep)
|
return ChatFormatterResponse(prompt=_prompt, stop=_sep)
|
||||||
|
|
||||||
|
|
||||||
|
@register_chat_format("mistral-instruct")
|
||||||
|
def format_mistral_instruct(
|
||||||
|
messages: List[llama_types.ChatCompletionRequestMessage],
|
||||||
|
**kwargs: Any,
|
||||||
|
) -> ChatFormatterResponse:
|
||||||
|
bos = "<s>"
|
||||||
|
eos = "</s>"
|
||||||
|
stop = eos
|
||||||
|
prompt = bos
|
||||||
|
for message in messages:
|
||||||
|
if message["role"] == "user" and message["content"] is not None and isinstance(message["content"], str):
|
||||||
|
prompt += "[INST] " + message["content"]
|
||||||
|
elif message["role"] == "assistant" and message["content"] is not None and isinstance(message["content"], str):
|
||||||
|
prompt += " [/INST]" + message["content"] + eos
|
||||||
|
prompt += " [/INST]"
|
||||||
|
return ChatFormatterResponse(prompt=prompt, stop=stop)
|
||||||
|
|
||||||
|
|
||||||
@register_chat_format("chatglm3")
|
@register_chat_format("chatglm3")
|
||||||
def format_chatglm3(
|
def format_chatglm3(
|
||||||
messages: List[llama_types.ChatCompletionRequestMessage],
|
messages: List[llama_types.ChatCompletionRequestMessage],
|
||||||
|
@ -904,7 +956,6 @@ def format_openchat(
|
||||||
_prompt = _format_chatml(system_message, _messages, _sep)
|
_prompt = _format_chatml(system_message, _messages, _sep)
|
||||||
return ChatFormatterResponse(prompt=_prompt, stop=_sep)
|
return ChatFormatterResponse(prompt=_prompt, stop=_sep)
|
||||||
|
|
||||||
|
|
||||||
# Chat format for Saiga models, see more details and available models:
|
# Chat format for Saiga models, see more details and available models:
|
||||||
# https://huggingface.co/collections/IlyaGusev/saiga2-saigamistral-6505d4ccc3d1e53166b636cd
|
# https://huggingface.co/collections/IlyaGusev/saiga2-saigamistral-6505d4ccc3d1e53166b636cd
|
||||||
@register_chat_format("saiga")
|
@register_chat_format("saiga")
|
||||||
|
@ -926,6 +977,7 @@ def format_saiga(
|
||||||
_prompt += "<s>bot"
|
_prompt += "<s>bot"
|
||||||
return ChatFormatterResponse(prompt=_prompt.strip())
|
return ChatFormatterResponse(prompt=_prompt.strip())
|
||||||
|
|
||||||
|
# Tricky chat formats that require custom chat handlers
|
||||||
|
|
||||||
@register_chat_completion_handler("functionary")
|
@register_chat_completion_handler("functionary")
|
||||||
def functionary_chat_handler(
|
def functionary_chat_handler(
|
||||||
|
@ -1434,10 +1486,14 @@ class Llava15ChatHandler:
|
||||||
prompt = llama.input_ids[: llama.n_tokens].tolist()
|
prompt = llama.input_ids[: llama.n_tokens].tolist()
|
||||||
|
|
||||||
if response_format is not None and response_format["type"] == "json_object":
|
if response_format is not None and response_format["type"] == "json_object":
|
||||||
with suppress_stdout_stderr(disable=self.verbose):
|
try:
|
||||||
grammar = llama_grammar.LlamaGrammar.from_string(
|
# create grammar from json schema
|
||||||
llama_grammar.JSON_GBNF
|
if "schema" in response_format:
|
||||||
)
|
grammar = llama_grammar.LlamaGrammar.from_json_schema(
|
||||||
|
json.dumps(response_format["schema"])
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
grammar = llama_grammar.LlamaGrammar.from_string(llama_grammar.JSON_GBNF)
|
||||||
|
|
||||||
return _convert_completion_to_chat(
|
return _convert_completion_to_chat(
|
||||||
llama.create_completion(
|
llama.create_completion(
|
||||||
|
|
|
@ -93,14 +93,12 @@ c_size_t_p = POINTER(c_size_t)
|
||||||
|
|
||||||
# from ggml-backend.h
|
# from ggml-backend.h
|
||||||
# typedef bool (*ggml_backend_sched_eval_callback)(struct ggml_tensor * t, bool ask, void * user_data);
|
# typedef bool (*ggml_backend_sched_eval_callback)(struct ggml_tensor * t, bool ask, void * user_data);
|
||||||
ggml_backend_sched_eval_callback = ctypes.CFUNCTYPE(
|
ggml_backend_sched_eval_callback = ctypes.CFUNCTYPE(c_bool, c_void_p, c_bool, c_void_p)
|
||||||
c_bool, c_void_p, c_bool, c_void_p
|
|
||||||
)
|
|
||||||
|
|
||||||
# llama.h bindings
|
# llama.h bindings
|
||||||
|
|
||||||
_lib.llama_max_devices.argtypes = []
|
_lib.llama_max_devices.argtypes = []
|
||||||
_lib.llama_max_devices.restype = ctypes.c_int32
|
_lib.llama_max_devices.restype = ctypes.c_size_t
|
||||||
|
|
||||||
LLAMA_MAX_DEVICES = _lib.llama_max_devices()
|
LLAMA_MAX_DEVICES = _lib.llama_max_devices()
|
||||||
|
|
||||||
|
@ -189,6 +187,7 @@ LLAMA_TOKEN_TYPE_BYTE = 6
|
||||||
# LLAMA_FTYPE_MOSTLY_IQ2_XS = 20, // except 1d tensors
|
# LLAMA_FTYPE_MOSTLY_IQ2_XS = 20, // except 1d tensors
|
||||||
# LLAMA_FTYPE_MOSTLY_Q2_K_S = 21, // except 1d tensors
|
# LLAMA_FTYPE_MOSTLY_Q2_K_S = 21, // except 1d tensors
|
||||||
# LLAMA_FTYPE_MOSTLY_Q3_K_XS = 22, // except 1d tensors
|
# LLAMA_FTYPE_MOSTLY_Q3_K_XS = 22, // except 1d tensors
|
||||||
|
# LLAMA_FTYPE_MOSTLY_IQ3_XXS = 23, // except 1d tensors
|
||||||
|
|
||||||
# LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
|
# LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
|
||||||
# };
|
# };
|
||||||
|
@ -213,6 +212,7 @@ LLAMA_FTYPE_MOSTLY_IQ2_XXS = 19
|
||||||
LLAMA_FTYPE_MOSTLY_IQ2_XS = 20
|
LLAMA_FTYPE_MOSTLY_IQ2_XS = 20
|
||||||
LLAMA_FTYPE_MOSTLY_Q2_K_S = 21
|
LLAMA_FTYPE_MOSTLY_Q2_K_S = 21
|
||||||
LLAMA_FTYPE_MOSTLY_Q3_K_XS = 22
|
LLAMA_FTYPE_MOSTLY_Q3_K_XS = 22
|
||||||
|
LLAMA_FTYPE_MOSTLY_IQ3_XXS = 23
|
||||||
LLAMA_FTYPE_GUESSED = 1024
|
LLAMA_FTYPE_GUESSED = 1024
|
||||||
|
|
||||||
# enum llama_rope_scaling_type {
|
# enum llama_rope_scaling_type {
|
||||||
|
@ -390,7 +390,7 @@ class llama_model_kv_override(Structure):
|
||||||
# // LLAMA_SPLIT_LAYER: ignored
|
# // LLAMA_SPLIT_LAYER: ignored
|
||||||
# int32_t main_gpu;
|
# int32_t main_gpu;
|
||||||
|
|
||||||
# // proportion of the model (layers or rows) to offload to each GPU, size: LLAMA_MAX_DEVICES
|
# // proportion of the model (layers or rows) to offload to each GPU, size: llama_max_devices()
|
||||||
# const float * tensor_split;
|
# const float * tensor_split;
|
||||||
|
|
||||||
# // Called with a progress value between 0.0 and 1.0. Pass NULL to disable.
|
# // Called with a progress value between 0.0 and 1.0. Pass NULL to disable.
|
||||||
|
@ -417,7 +417,7 @@ class llama_model_params(Structure):
|
||||||
n_gpu_layers (int): number of layers to store in VRAM
|
n_gpu_layers (int): number of layers to store in VRAM
|
||||||
split_mode (int): how to split the model across multiple GPUs
|
split_mode (int): how to split the model across multiple GPUs
|
||||||
main_gpu (int): the GPU that is used for the entire model. main_gpu interpretation depends on split_mode: LLAMA_SPLIT_NONE: the GPU that is used for the entire model LLAMA_SPLIT_ROW: the GPU that is used for small tensors and intermediate results LLAMA_SPLIT_LAYER: ignored
|
main_gpu (int): the GPU that is used for the entire model. main_gpu interpretation depends on split_mode: LLAMA_SPLIT_NONE: the GPU that is used for the entire model LLAMA_SPLIT_ROW: the GPU that is used for small tensors and intermediate results LLAMA_SPLIT_LAYER: ignored
|
||||||
tensor_split (ctypes.Array[ctypes.c_float]): proportion of the model (layers or rows) to offload to each GPU, size: LLAMA_MAX_DEVICES
|
tensor_split (ctypes.Array[ctypes.c_float]): proportion of the model (layers or rows) to offload to each GPU, size: llama_max_devices()
|
||||||
progress_callback (llama_progress_callback): called with a progress value between 0.0 and 1.0. Pass NULL to disable. If the provided progress_callback returns true, model loading continues. If it returns false, model loading is immediately aborted.
|
progress_callback (llama_progress_callback): called with a progress value between 0.0 and 1.0. Pass NULL to disable. If the provided progress_callback returns true, model loading continues. If it returns false, model loading is immediately aborted.
|
||||||
progress_callback_user_data (ctypes.c_void_p): context pointer passed to the progress callback
|
progress_callback_user_data (ctypes.c_void_p): context pointer passed to the progress callback
|
||||||
kv_overrides (ctypes.Array[llama_model_kv_override]): override key-value pairs of the model meta data
|
kv_overrides (ctypes.Array[llama_model_kv_override]): override key-value pairs of the model meta data
|
||||||
|
@ -760,16 +760,43 @@ _lib.llama_time_us.argtypes = []
|
||||||
_lib.llama_time_us.restype = ctypes.c_int64
|
_lib.llama_time_us.restype = ctypes.c_int64
|
||||||
|
|
||||||
|
|
||||||
# LLAMA_API int32_t llama_max_devices(void);
|
# LLAMA_API size_t llama_max_devices(void);
|
||||||
def llama_max_devices() -> int:
|
def llama_max_devices() -> int:
|
||||||
return _lib.llama_max_devices()
|
return _lib.llama_max_devices()
|
||||||
|
|
||||||
|
|
||||||
_lib.llama_max_devices.argtypes = []
|
_lib.llama_max_devices.argtypes = []
|
||||||
_lib.llama_max_devices.restype = ctypes.c_int32
|
_lib.llama_max_devices.restype = ctypes.c_size_t
|
||||||
|
|
||||||
|
|
||||||
# LLAMA_API bool llama_mmap_supported (void);
|
# LLAMA_API bool llama_supports_mmap (void);
|
||||||
|
def llama_supports_mmap() -> bool:
|
||||||
|
return _lib.llama_supports_mmap()
|
||||||
|
|
||||||
|
|
||||||
|
_lib.llama_supports_mmap.argtypes = []
|
||||||
|
_lib.llama_supports_mmap.restype = c_bool
|
||||||
|
|
||||||
|
|
||||||
|
# LLAMA_API bool llama_supports_mlock (void);
|
||||||
|
def llama_supports_mlock() -> bool:
|
||||||
|
return _lib.llama_supports_mlock()
|
||||||
|
|
||||||
|
|
||||||
|
_lib.llama_supports_mlock.argtypes = []
|
||||||
|
_lib.llama_supports_mlock.restype = c_bool
|
||||||
|
|
||||||
|
|
||||||
|
# LLAMA_API bool llama_supports_gpu_offload(void);
|
||||||
|
def llama_supports_gpu_offload() -> bool:
|
||||||
|
return _lib.llama_supports_gpu_offload()
|
||||||
|
|
||||||
|
|
||||||
|
_lib.llama_supports_gpu_offload.argtypes = []
|
||||||
|
_lib.llama_supports_gpu_offload.restype = c_bool
|
||||||
|
|
||||||
|
|
||||||
|
# LLAMA_API DEPRECATED(bool llama_mmap_supported (void), "use llama_supports_mmap() instead");
|
||||||
def llama_mmap_supported() -> bool:
|
def llama_mmap_supported() -> bool:
|
||||||
return _lib.llama_mmap_supported()
|
return _lib.llama_mmap_supported()
|
||||||
|
|
||||||
|
@ -778,7 +805,7 @@ _lib.llama_mmap_supported.argtypes = []
|
||||||
_lib.llama_mmap_supported.restype = c_bool
|
_lib.llama_mmap_supported.restype = c_bool
|
||||||
|
|
||||||
|
|
||||||
# LLAMA_API bool llama_mlock_supported(void);
|
# LLAMA_API DEPRECATED(bool llama_mlock_supported(void), "use llama_supports_mlock() instead");
|
||||||
def llama_mlock_supported() -> bool:
|
def llama_mlock_supported() -> bool:
|
||||||
return _lib.llama_mlock_supported()
|
return _lib.llama_mlock_supported()
|
||||||
|
|
||||||
|
@ -2174,6 +2201,34 @@ _lib.llama_sample_typical.argtypes = [
|
||||||
_lib.llama_sample_typical.restype = None
|
_lib.llama_sample_typical.restype = None
|
||||||
|
|
||||||
|
|
||||||
|
# /// @details Dynamic temperature implementation described in the paper https://arxiv.org/abs/2309.02772.
|
||||||
|
# LLAMA_API void llama_sample_entropy(
|
||||||
|
# struct llama_context * ctx,
|
||||||
|
# llama_token_data_array * candidates_p,
|
||||||
|
# float min_temp,
|
||||||
|
# float max_temp,
|
||||||
|
# float exponent_val);
|
||||||
|
def llama_sample_entropy(
|
||||||
|
ctx: llama_context_p,
|
||||||
|
candidates, # type: _Pointer[llama_token_data_array]
|
||||||
|
min_temp: Union[c_float, float],
|
||||||
|
max_temp: Union[c_float, float],
|
||||||
|
exponent_val: Union[c_float, float],
|
||||||
|
):
|
||||||
|
"""Dynamic temperature implementation described in the paper https://arxiv.org/abs/2309.02772."""
|
||||||
|
return _lib.llama_sample_entropy(ctx, candidates, min_temp, max_temp, exponent_val)
|
||||||
|
|
||||||
|
|
||||||
|
_lib.llama_sample_entropy.argtypes = [
|
||||||
|
llama_context_p,
|
||||||
|
llama_token_data_array_p,
|
||||||
|
c_float,
|
||||||
|
c_float,
|
||||||
|
c_float,
|
||||||
|
]
|
||||||
|
_lib.llama_sample_entropy.restype = None
|
||||||
|
|
||||||
|
|
||||||
# LLAMA_API void llama_sample_temp(
|
# LLAMA_API void llama_sample_temp(
|
||||||
# struct llama_context * ctx,
|
# struct llama_context * ctx,
|
||||||
# llama_token_data_array * candidates,
|
# llama_token_data_array * candidates,
|
||||||
|
|
|
@ -154,6 +154,7 @@ class ChatCompletionFunctionCallOption(TypedDict):
|
||||||
|
|
||||||
class ChatCompletionRequestResponseFormat(TypedDict):
|
class ChatCompletionRequestResponseFormat(TypedDict):
|
||||||
type: Literal["text", "json_object"]
|
type: Literal["text", "json_object"]
|
||||||
|
schema: NotRequired[JsonType] # https://docs.endpoints.anyscale.com/guides/json_mode/
|
||||||
|
|
||||||
|
|
||||||
class ChatCompletionRequestMessageContentPartText(TypedDict):
|
class ChatCompletionRequestMessageContentPartText(TypedDict):
|
||||||
|
|
|
@ -113,8 +113,8 @@ class ModelSettings(BaseSettings):
|
||||||
description="Enable NUMA support.",
|
description="Enable NUMA support.",
|
||||||
)
|
)
|
||||||
# Chat Format Params
|
# Chat Format Params
|
||||||
chat_format: str = Field(
|
chat_format: Optional[str] = Field(
|
||||||
default="llama-2",
|
default=None,
|
||||||
description="Chat format to use.",
|
description="Chat format to use.",
|
||||||
)
|
)
|
||||||
clip_model_path: Optional[str] = Field(
|
clip_model_path: Optional[str] = Field(
|
||||||
|
|
|
@ -1,10 +1,33 @@
|
||||||
import json
|
import json
|
||||||
|
|
||||||
|
import jinja2
|
||||||
|
|
||||||
from llama_cpp import (
|
from llama_cpp import (
|
||||||
ChatCompletionRequestUserMessage,
|
ChatCompletionRequestUserMessage,
|
||||||
)
|
)
|
||||||
|
import llama_cpp.llama_types as llama_types
|
||||||
|
import llama_cpp.llama_chat_format as llama_chat_format
|
||||||
|
|
||||||
from llama_cpp.llama_chat_format import hf_tokenizer_config_to_chat_formatter
|
from llama_cpp.llama_chat_format import hf_tokenizer_config_to_chat_formatter
|
||||||
|
|
||||||
|
def test_mistral_instruct():
|
||||||
|
chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"
|
||||||
|
chat_formatter = jinja2.Template(chat_template)
|
||||||
|
messages = [
|
||||||
|
llama_types.ChatCompletionRequestUserMessage(role="user", content="Instruction"),
|
||||||
|
llama_types.ChatCompletionRequestAssistantMessage(role="assistant", content="Model answer"),
|
||||||
|
llama_types.ChatCompletionRequestUserMessage(role="user", content="Follow-up instruction"),
|
||||||
|
]
|
||||||
|
response = llama_chat_format.format_mistral_instruct(
|
||||||
|
messages=messages,
|
||||||
|
)
|
||||||
|
reference = chat_formatter.render(
|
||||||
|
messages=messages,
|
||||||
|
bos_token="<s>",
|
||||||
|
eos_token="</s>",
|
||||||
|
)
|
||||||
|
assert response.prompt == reference
|
||||||
|
|
||||||
|
|
||||||
mistral_7b_tokenizer_config = """{
|
mistral_7b_tokenizer_config = """{
|
||||||
"add_bos_token": true,
|
"add_bos_token": true,
|
||||||
|
|
2
vendor/llama.cpp
vendored
2
vendor/llama.cpp
vendored
|
@ -1 +1 @@
|
||||||
Subproject commit faa3526a1eba458120987ed8269e5616385a76f4
|
Subproject commit 5cb04dbc16d1da38c8fdcc0111b40e67d00dd1c3
|
Loading…
Reference in a new issue