In some cases we may want multiple variants for a given GPU type or CPU.
This adds logic to have an optional Variant which we can use to select
an optimal library, but also allows us to try multiple variants in case
some fail to load.
This can be useful for scenarios such as ROCm v5 vs v6 incompatibility
or potentially CPU features.
* increase minimum cuda overhead and fix minimum overhead for multi-gpu
* fix multi gpu overhead
* limit overhead to 10% of all gpus
* better wording
* allocate fixed amount before layers
* fixed only includes graph alloc
When there are multiple management libraries installed on a system
not every one will be compatible with the current driver. This change
improves our management library algorithm to build up a set of discovered
libraries based on glob patterns, and then try all of them until we're able to
load one without error.
If you attempt to run the current CUDA build on compute capability 5.2
cards, you'll hit the following failure:
cuBLAS error 15 at ggml-cuda.cu:7956: the requested functionality is not supported
* select layers based on estimated model memory usage
* always account for scratch vram
* dont load +1 layers
* better estmation for graph alloc
* Update gpu/gpu_darwin.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update llm/llm.go
* add overhead for cuda memory
* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* fix build error on linux
* address comments
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
If we try to load the CUDA library on an old GPU, it panics and crashes
the server. This checks the compute capability before we load the
library so we can gracefully fall back to CPU mode.
Refactor where we store build outputs, and support a fully dynamic loading
model on windows so the base executable has no special dependencies thus
doesn't require a special PATH.
This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.
This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.