docs: Update high-level python api examples in README to include chat formats, function calling, and multi-modal models.
This commit is contained in:
parent
d977b44d82
commit
bd43fb2bfe
1 changed files with 112 additions and 2 deletions
114
README.md
114
README.md
|
@ -110,12 +110,17 @@ Detailed MacOS Metal GPU install documentation is available at [docs/install/mac
|
||||||
|
|
||||||
The high-level API provides a simple managed interface through the `Llama` class.
|
The high-level API provides a simple managed interface through the `Llama` class.
|
||||||
|
|
||||||
Below is a short example demonstrating how to use the high-level API to generate text:
|
Below is a short example demonstrating how to use the high-level API to for basic text completion:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from llama_cpp import Llama
|
>>> from llama_cpp import Llama
|
||||||
>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
|
>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
|
||||||
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
|
>>> output = llm(
|
||||||
|
"Q: Name the planets in the solar system? A: ", # Prompt
|
||||||
|
max_tokens=32, # Generate up to 32 tokens
|
||||||
|
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
|
||||||
|
echo=True # Echo the prompt back in the output
|
||||||
|
)
|
||||||
>>> print(output)
|
>>> print(output)
|
||||||
{
|
{
|
||||||
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
|
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
|
||||||
|
@ -138,6 +143,111 @@ Below is a short example demonstrating how to use the high-level API to generate
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Chat Completion
|
||||||
|
|
||||||
|
The high-level API also provides a simple interface for chat completion.
|
||||||
|
|
||||||
|
Note that `chat_format` option must be set for the particular model you are using.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from llama_cpp import Llama
|
||||||
|
>>> llm = Llama(model_path="path/to/llama-2/llama-model.gguf", chat_format="llama-2")
|
||||||
|
>>> llm.create_chat_completion(
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "You are an assistant who perfectly describes images."},
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": "Describe this image in detail please."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Function Calling
|
||||||
|
|
||||||
|
The high-level API also provides a simple interface for function calling.
|
||||||
|
|
||||||
|
Note that the only model that supports full function calling at this time is "functionary".
|
||||||
|
The gguf-converted files for this model can be found here: [functionary-7b-v1](https://huggingface.co/abetlen/functionary-7b-v1-GGUF)
|
||||||
|
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from llama_cpp import Llama
|
||||||
|
>>> llm = Llama(model_path="path/to/functionary/llama-model.gguf", chat_format="functionary")
|
||||||
|
>>> llm.create_chat_completion(
|
||||||
|
messages = [
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": "A chat between a curious user and an artificial intelligence assitant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant callse functions with appropriate input when necessary"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": "Extract Jason is 25 years old"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
tools=[{
|
||||||
|
"type": "function",
|
||||||
|
"function": {
|
||||||
|
"name": "UserDetail",
|
||||||
|
"parameters": {
|
||||||
|
"type": "object"
|
||||||
|
"title": "UserDetail",
|
||||||
|
"properties": {
|
||||||
|
"name": {
|
||||||
|
"title": "Name",
|
||||||
|
"type": "string"
|
||||||
|
},
|
||||||
|
"age": {
|
||||||
|
"title": "Age",
|
||||||
|
"type": "integer"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"required": [ "name", "age" ]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}],
|
||||||
|
tool_choices=[{
|
||||||
|
"type": "function",
|
||||||
|
"function": {
|
||||||
|
"name": "UserDetail"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-modal Models
|
||||||
|
|
||||||
|
|
||||||
|
`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
|
||||||
|
read information from both text and images.
|
||||||
|
|
||||||
|
You'll first need to download one of the available multi-modal models in GGUF format:
|
||||||
|
|
||||||
|
- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
|
||||||
|
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
|
||||||
|
- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)
|
||||||
|
|
||||||
|
Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from llama_cpp import Llama
|
||||||
|
>>> from llama_cpp.llama_chat_format import Llava15ChatHandler
|
||||||
|
>>> chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")
|
||||||
|
>>> llm = Llama(model_path="./path/to/llava/llama-model.gguf", chat_handler=chat_handler)
|
||||||
|
>>> llm.create_chat_completion(
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "You are an assistant who perfectly describes images."},
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": [
|
||||||
|
{"type": "image_url", "image_url": {"url": "https://.../image.png"}},
|
||||||
|
{"type" : "text", "text": "Describe this image in detail please."}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
### Adjusting the Context Window
|
### Adjusting the Context Window
|
||||||
|
|
||||||
The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.
|
The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.
|
||||||
|
|
Loading…
Reference in a new issue