Merge pull request #5442 from dhiltgen/concurrency_docs

Add windows radeon concurrency note
2024-07-02 12:47:47 -07:00 · 2024-07-02 12:47:47 -07:00 · d2f19024d0
commit d2f19024d0
parent 996bb1b85e 69c04eecc4
1 changed files with 3 additions and 1 deletions
--- a/docs/faq.md
+++ b/docs/faq.md
@ -266,8 +266,10 @@ If there is insufficient available memory to load a new model request while one

 Parallel request processing for a given model results in increasing the context size by the number of parallel requests.  For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.

-The following server settings may be used to adjust how Ollama handles concurrent requests:
+The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:

 - `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory.  The default is 3 * the number of GPUs or 3 for CPU inference.
 - `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time.  The default will auto-select either 4 or 1 based on available memory.
 - `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
+
+Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting.  Once ROCm v6 is available, Windows Radeon will follow the defaults above.  You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.