From aae56abb7cc96b8495a1c761a08b92cfd136d9d2 Mon Sep 17 00:00:00 2001 From: Daniel Hiltgen Date: Fri, 28 Jun 2024 13:15:57 -0700 Subject: [PATCH] Document concurrent behavior and settings --- docs/faq.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/docs/faq.md b/docs/faq.md index b50a3138..841f1d13 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -257,3 +257,17 @@ If you wish to override the `OLLAMA_KEEP_ALIVE` setting, use the `keep_alive` AP ## How do I manage the maximum number of requests the Ollama server can queue? If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`. + +## How does Ollama handle concurrent requests? + +Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing. + +If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads. + +Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation. + +The following server settings may be used to adjust how Ollama handles concurrent requests: + +- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference. +- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory. +- `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512