17b7186cd7
This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM. |
||
---|---|---|
.. | ||
ext_server | ||
generate | ||
llama.cpp@7c26775adb | ||
patches | ||
filetype.go | ||
ggla.go | ||
ggml.go | ||
gguf.go | ||
llm.go | ||
llm_darwin_amd64.go | ||
llm_darwin_arm64.go | ||
llm_linux.go | ||
llm_windows.go | ||
memory.go | ||
memory_test.go | ||
payload.go | ||
server.go | ||
status.go |