fe2da09538
* Test dummy image tags in chat templates * Format and improve types for llava_cpp.py * Add from_pretrained support to llava chat format. * Refactor llava chat format to use a jinja2 * Revert chat format test * Add moondream support (wip) * Update moondream chat format * Update moondream chat format * Update moondream prompt * Add function calling support * Cache last image embed * Add Llava1.6 support * Add nanollava support * Add obisidian support * Remove unnecessary import * Re-order multimodal chat formats * Logits all no longer required for multi-modal models * Update README.md * Update docs * Update README * Fix typo * Update README * Fix typo
222 lines
7.6 KiB
Markdown
222 lines
7.6 KiB
Markdown
# OpenAI Compatible Server
|
|
|
|
`llama-cpp-python` offers an OpenAI API compatible web server.
|
|
|
|
This web server can be used to serve local models and easily connect them to existing clients.
|
|
|
|
## Setup
|
|
|
|
### Installation
|
|
|
|
The server can be installed by running the following command:
|
|
|
|
```bash
|
|
pip install llama-cpp-python[server]
|
|
```
|
|
|
|
### Running the server
|
|
|
|
The server can then be started by running the following command:
|
|
|
|
```bash
|
|
python3 -m llama_cpp.server --model <model_path>
|
|
```
|
|
|
|
### Server options
|
|
|
|
For a full list of options, run:
|
|
|
|
```bash
|
|
python3 -m llama_cpp.server --help
|
|
```
|
|
|
|
NOTE: All server options are also available as environment variables. For example, `--model` can be set by setting the `MODEL` environment variable.
|
|
|
|
Check out the server config reference below settings for more information on the available options.
|
|
CLI arguments and environment variables are available for all of the fields defined in [`ServerSettings`](#llama_cpp.server.settings.ServerSettings) and [`ModelSettings`](#llama_cpp.server.settings.ModelSettings)
|
|
|
|
Additionally the server supports configuration check out the [configuration section](#configuration-and-multi-model-support) for more information and examples.
|
|
|
|
|
|
## Guides
|
|
|
|
### Code Completion
|
|
|
|
`llama-cpp-python` supports code completion via GitHub Copilot.
|
|
|
|
*NOTE*: Without GPU acceleration this is unlikely to be fast enough to be usable.
|
|
|
|
You'll first need to download one of the available code completion models in GGUF format:
|
|
|
|
- [replit-code-v1_5-GGUF](https://huggingface.co/abetlen/replit-code-v1_5-3b-GGUF)
|
|
|
|
Then you'll need to run the OpenAI compatible web server with a increased context size substantially for GitHub Copilot requests:
|
|
|
|
```bash
|
|
python3 -m llama_cpp.server --model <model_path> --n_ctx 16192
|
|
```
|
|
|
|
Then just update your settings in `.vscode/settings.json` to point to your code completion server:
|
|
|
|
```json
|
|
{
|
|
// ...
|
|
"github.copilot.advanced": {
|
|
"debug.testOverrideProxyUrl": "http://<host>:<port>",
|
|
"debug.overrideProxyUrl": "http://<host>:<port>"
|
|
}
|
|
// ...
|
|
}
|
|
```
|
|
|
|
### Function Calling
|
|
|
|
`llama-cpp-python` supports structured function calling based on a JSON schema.
|
|
Function calling is completely compatible with the OpenAI function calling API and can be used by connecting with the official OpenAI Python client.
|
|
|
|
You'll first need to download one of the available function calling models in GGUF format:
|
|
|
|
- [functionary](https://huggingface.co/meetkai)
|
|
|
|
Then when you run the server you'll need to also specify either `functionary-v1` or `functionary-v2` chat_format.
|
|
|
|
Note that since functionary requires a HF Tokenizer due to discrepancies between llama.cpp and HuggingFace's tokenizers as mentioned [here](https://github.com/abetlen/llama-cpp-python/blob/main?tab=readme-ov-file#function-calling), you will need to pass in the path to the tokenizer too. The tokenizer files are already included in the respective HF repositories hosting the gguf files.
|
|
|
|
```bash
|
|
python3 -m llama_cpp.server --model <model_path_to_functionary_v2_model> --chat_format functionary-v2 --hf_pretrained_model_name_or_path <model_path_to_functionary_v2_tokenizer>
|
|
```
|
|
|
|
Check out this [example notebook](https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb) for a walkthrough of some interesting use cases for function calling.
|
|
|
|
### Multimodal Models
|
|
|
|
`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
|
|
read information from both text and images.
|
|
|
|
You'll first need to download one of the available multi-modal models in GGUF format:
|
|
|
|
- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
|
|
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
|
|
- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)
|
|
- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)
|
|
- [moondream2](https://huggingface.co/vikhyatk/moondream2)
|
|
|
|
Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format
|
|
|
|
```bash
|
|
python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5
|
|
```
|
|
|
|
Then you can just use the OpenAI API as normal
|
|
|
|
```python3
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI(base_url="http://<host>:<port>/v1", api_key="sk-xxx")
|
|
response = client.chat.completions.create(
|
|
model="gpt-4-vision-preview",
|
|
messages=[
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{
|
|
"type": "image_url",
|
|
"image_url": {
|
|
"url": "<image_url>"
|
|
},
|
|
},
|
|
{"type": "text", "text": "What does the image say"},
|
|
],
|
|
}
|
|
],
|
|
)
|
|
print(response)
|
|
```
|
|
|
|
## Configuration and Multi-Model Support
|
|
|
|
The server supports configuration via a JSON config file that can be passed using the `--config_file` parameter or the `CONFIG_FILE` environment variable.
|
|
|
|
```bash
|
|
python3 -m llama_cpp.server --config_file <config_file>
|
|
```
|
|
|
|
Config files support all of the server and model options supported by the cli and environment variables however instead of only a single model the config file can specify multiple models.
|
|
|
|
The server supports routing requests to multiple models based on the `model` parameter in the request which matches against the `model_alias` in the config file.
|
|
|
|
At the moment only a single model is loaded into memory at, the server will automatically load and unload models as needed.
|
|
|
|
```json
|
|
{
|
|
"host": "0.0.0.0",
|
|
"port": 8080,
|
|
"models": [
|
|
{
|
|
"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
|
|
"model_alias": "gpt-3.5-turbo",
|
|
"chat_format": "chatml",
|
|
"n_gpu_layers": -1,
|
|
"offload_kqv": true,
|
|
"n_threads": 12,
|
|
"n_batch": 512,
|
|
"n_ctx": 2048
|
|
},
|
|
{
|
|
"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
|
|
"model_alias": "gpt-4",
|
|
"chat_format": "chatml",
|
|
"n_gpu_layers": -1,
|
|
"offload_kqv": true,
|
|
"n_threads": 12,
|
|
"n_batch": 512,
|
|
"n_ctx": 2048
|
|
},
|
|
{
|
|
"model": "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf",
|
|
"model_alias": "gpt-4-vision-preview",
|
|
"chat_format": "llava-1-5",
|
|
"clip_model_path": "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf",
|
|
"n_gpu_layers": -1,
|
|
"offload_kqv": true,
|
|
"n_threads": 12,
|
|
"n_batch": 512,
|
|
"n_ctx": 2048
|
|
},
|
|
{
|
|
"model": "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf",
|
|
"model_alias": "text-davinci-003",
|
|
"n_gpu_layers": -1,
|
|
"offload_kqv": true,
|
|
"n_threads": 12,
|
|
"n_batch": 512,
|
|
"n_ctx": 2048
|
|
},
|
|
{
|
|
"model": "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf",
|
|
"model_alias": "copilot-codex",
|
|
"n_gpu_layers": -1,
|
|
"offload_kqv": true,
|
|
"n_threads": 12,
|
|
"n_batch": 1024,
|
|
"n_ctx": 9216
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
The config file format is defined by the [`ConfigFileSettings`](#llama_cpp.server.settings.ConfigFileSettings) class.
|
|
|
|
## Server Options Reference
|
|
|
|
::: llama_cpp.server.settings.ConfigFileSettings
|
|
options:
|
|
show_if_no_docstring: true
|
|
|
|
::: llama_cpp.server.settings.ServerSettings
|
|
options:
|
|
show_if_no_docstring: true
|
|
|
|
::: llama_cpp.server.settings.ModelSettings
|
|
options:
|
|
show_if_no_docstring: true
|