llama.cpp/docs/server.md

# OpenAI Compatible Server

`llama-cpp-python` offers an OpenAI API compatible web server.

This web server can be used to serve local models and easily connect them to existing clients.

## Setup

### Installation

The server can be installed by running the following command:

```bash
pip install llama-cpp-python[server]
```

### Running the server

The server can then be started by running the following command:

```bash
python3 -m llama_cpp.server --model <model_path>
```

### Server options

For a full list of options, run:

```bash
python3 -m llama_cpp.server --help
```

NOTE: All server options are also available as environment variables. For example, `--model` can be set by setting the `MODEL` environment variable.

## Guides

### Function Calling

`llama-cpp-python` supports structured function calling based on a JSON schema.

You'll first need to download one of the available function calling models in GGUF format:

- [functionary-7b-v1](https://huggingface.co/abetlen/functionary-7b-v1-GGUF)

Then when you run the server you'll need to also specify the `functionary-7b-v1` chat_format

```bash
python3 -m llama_cpp.server --model <model_path> --chat_format functionary
```

### Multimodal Models

`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
read information from both text and images.

You'll first need to download one of the available multi-modal models in GGUF format:

- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)

Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format

```bash
python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5
```

Then you can just use the OpenAI API as normal

```python3
from openai import OpenAI

client = OpenAI(base_url="http://<host>:<port>/v1", api_key="sk-xxx")
response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "<image_url>"
                    },
                },
                {"type": "text", "text": "What does the image say"},
            ],
        }
    ],
)
print(response)
```
Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 03:48:51 +00:00			`# OpenAI Compatible Server`

			`llama-cpp-python` offers an OpenAI API compatible web server.

			`This web server can be used to serve local models and easily connect them to existing clients.`

			`## Setup`

			`### Installation`

			`The server can be installed by running the following command:`

			```bash
			`pip install llama-cpp-python[server]`
			```

			`### Running the server`

			`The server can then be started by running the following command:`

			```bash
			`python3 -m llama_cpp.server --model <model_path>`
			```

			`### Server options`

			`For a full list of options, run:`

			```bash
			`python3 -m llama_cpp.server --help`
			```

			NOTE: All server options are also available as environment variables. For example, `--model` can be set by setting the `MODEL` environment variable.

			`## Guides`

Update server docs 2023-11-08 05:52:13 +00:00			`### Function Calling`

			`llama-cpp-python` supports structured function calling based on a JSON schema.

			`You'll first need to download one of the available function calling models in GGUF format:`

			`- [functionary-7b-v1](https://huggingface.co/abetlen/functionary-7b-v1-GGUF)`

			Then when you run the server you'll need to also specify the `functionary-7b-v1` chat_format

			```bash
Fix server doc arguments (#892) 2023-11-09 04:53:00 +00:00			`python3 -m llama_cpp.server --model <model_path> --chat_format functionary`
Update server docs 2023-11-08 05:52:13 +00:00			```

			`### Multimodal Models`
Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 03:48:51 +00:00
			`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
			`read information from both text and images.`

			`You'll first need to download one of the available multi-modal models in GGUF format:`

Update server docs 2023-11-08 05:52:13 +00:00			`- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)`
			`- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)`
Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 03:48:51 +00:00
Fix docs multi-modal docs 2023-11-08 03:52:08 +00:00			Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format
Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 03:48:51 +00:00
			```bash
Fix server doc arguments (#892) 2023-11-09 04:53:00 +00:00			`python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5`
Multimodal Support (Llava 1.5) (#821) * llava v1.5 integration * Point llama.cpp to fork * Add llava shared library target * Fix type * Update llama.cpp * Add llava api * Revert changes to llama and llama_cpp * Update llava example * Add types for new gpt-4-vision-preview api * Fix typo * Update llama.cpp * Update llama_types to match OpenAI v1 API * Update ChatCompletionFunction type * Reorder request parameters * More API type fixes * Even More Type Updates * Add parameter for custom chat_handler to Llama class * Fix circular import * Convert to absolute imports * Fix * Fix pydantic Jsontype bug * Accept list of prompt tokens in create_completion * Add llava1.5 chat handler * Add Multimodal notebook * Clean up examples * Add server docs --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> 2023-11-08 03:48:51 +00:00			```

			`Then you can just use the OpenAI API as normal`

			```python3
			`from openai import OpenAI`

			`client = OpenAI(base_url="http://<host>:<port>/v1", api_key="sk-xxx")`
			`response = client.chat.completions.create(`
			`model="gpt-4-vision-preview",`
			`messages=[`
			`{`
			`"role": "user",`
			`"content": [`
			`{`
			`"type": "image_url",`
			`"image_url": {`
			`"url": "<image_url>"`
			`},`
			`},`
			`{"type": "text", "text": "What does the image say"},`
			`],`
			`}`
			`],`
			`)`
			`print(response)`
			```