documentation for stopping a model (#6766)
This commit is contained in:
parent
bf7ee0f4d4
commit
5804cf1723
3 changed files with 105 additions and 4 deletions
12
README.md
12
README.md
|
@ -197,6 +197,18 @@ ollama show llama3.1
|
||||||
ollama list
|
ollama list
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### List which models are currently loaded
|
||||||
|
|
||||||
|
```
|
||||||
|
ollama ps
|
||||||
|
```
|
||||||
|
|
||||||
|
### Stop a model which is currently running
|
||||||
|
|
||||||
|
```
|
||||||
|
ollama stop llama3.1
|
||||||
|
```
|
||||||
|
|
||||||
### Start Ollama
|
### Start Ollama
|
||||||
|
|
||||||
`ollama serve` is used when you want to start ollama without running the desktop application.
|
`ollama serve` is used when you want to start ollama without running the desktop application.
|
||||||
|
|
85
docs/api.md
85
docs/api.md
|
@ -407,6 +407,33 @@ A single JSON object is returned:
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Unload a model
|
||||||
|
|
||||||
|
If an empty prompt is provided and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory.
|
||||||
|
|
||||||
|
##### Request
|
||||||
|
|
||||||
|
```shell
|
||||||
|
curl http://localhost:11434/api/generate -d '{
|
||||||
|
"model": "llama3.1",
|
||||||
|
"keep_alive": 0
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Response
|
||||||
|
|
||||||
|
A single JSON object is returned:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "llama3.1",
|
||||||
|
"created_at": "2024-09-12T03:54:03.516566Z",
|
||||||
|
"response": "",
|
||||||
|
"done": true,
|
||||||
|
"done_reason": "unload"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
## Generate a chat completion
|
## Generate a chat completion
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
|
@ -736,6 +763,64 @@ curl http://localhost:11434/api/chat -d '{
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Load a model
|
||||||
|
|
||||||
|
If the messages array is empty, the model will be loaded into memory.
|
||||||
|
|
||||||
|
##### Request
|
||||||
|
|
||||||
|
```
|
||||||
|
curl http://localhost:11434/api/chat -d '{
|
||||||
|
"model": "llama3.1",
|
||||||
|
"messages": []
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Response
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "llama3.1",
|
||||||
|
"created_at":"2024-09-12T21:17:29.110811Z",
|
||||||
|
"message": {
|
||||||
|
"role": "assistant",
|
||||||
|
"content": ""
|
||||||
|
},
|
||||||
|
"done_reason": "load",
|
||||||
|
"done": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Unload a model
|
||||||
|
|
||||||
|
If the messages array is empty and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory.
|
||||||
|
|
||||||
|
##### Request
|
||||||
|
|
||||||
|
```
|
||||||
|
curl http://localhost:11434/api/chat -d '{
|
||||||
|
"model": "llama3.1",
|
||||||
|
"messages": [],
|
||||||
|
"keep_alive": 0
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Response
|
||||||
|
|
||||||
|
A single JSON object is returned:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "llama3.1",
|
||||||
|
"created_at":"2024-09-12T21:33:17.547535Z",
|
||||||
|
"message": {
|
||||||
|
"role": "assistant",
|
||||||
|
"content": ""
|
||||||
|
},
|
||||||
|
"done_reason": "unload",
|
||||||
|
"done": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
## Create a Model
|
## Create a Model
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
|
|
12
docs/faq.md
12
docs/faq.md
|
@ -237,9 +237,13 @@ ollama run llama3.1 ""
|
||||||
|
|
||||||
## How do I keep a model loaded in memory or make it unload immediately?
|
## How do I keep a model loaded in memory or make it unload immediately?
|
||||||
|
|
||||||
By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the `keep_alive` parameter with either the `/api/generate` and `/api/chat` API endpoints to control how long the model is left in memory.
|
By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you're making numerous requests to the LLM. If you want to immediately unload a model from memory, use the `ollama stop` command:
|
||||||
|
|
||||||
The `keep_alive` parameter can be set to:
|
```shell
|
||||||
|
ollama stop llama3.1
|
||||||
|
```
|
||||||
|
|
||||||
|
If you're using the API, use the `keep_alive` parameter with the `/api/generate` and `/api/chat` endpoints to set the amount of time that a model stays in memory. The `keep_alive` parameter can be set to:
|
||||||
* a duration string (such as "10m" or "24h")
|
* a duration string (such as "10m" or "24h")
|
||||||
* a number in seconds (such as 3600)
|
* a number in seconds (such as 3600)
|
||||||
* any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
|
* any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
|
||||||
|
@ -255,9 +259,9 @@ To unload the model and free up memory use:
|
||||||
curl http://localhost:11434/api/generate -d '{"model": "llama3.1", "keep_alive": 0}'
|
curl http://localhost:11434/api/generate -d '{"model": "llama3.1", "keep_alive": 0}'
|
||||||
```
|
```
|
||||||
|
|
||||||
Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable.
|
Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to the section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable.
|
||||||
|
|
||||||
If you wish to override the `OLLAMA_KEEP_ALIVE` setting, use the `keep_alive` API parameter with the `/api/generate` or `/api/chat` API endpoints.
|
The `keep_alive` API parameter with the `/api/generate` and `/api/chat` API endpoints will override the `OLLAMA_KEEP_ALIVE` setting.
|
||||||
|
|
||||||
## How do I manage the maximum number of requests the Ollama server can queue?
|
## How do I manage the maximum number of requests the Ollama server can queue?
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue