ollama

Author	SHA1	Message	Date
Daniel Hiltgen	4072b5879b	Merge pull request #2246 from dhiltgen/reject_cuda_without_avx Don't disable GPUs on arm without AVX	2024-01-28 16:26:55 -08:00
Daniel Hiltgen	15562e887d	Don't disable GPUs on arm without AVX AVX is an x86 feature, so ARM should be excluded from the check.	2024-01-28 15:22:38 -08:00
Daniel Hiltgen	f07f8b7a9e	Harden for zero detected GPUs At least with the ROCm libraries, its possible to have the library present with zero GPUs. This fix avoids a divide by zero bug in llm.go when we try to calculate GPU memory with zero GPUs.	2024-01-28 13:13:10 -08:00
Daniel Hiltgen	e02ecfb6c8	Merge pull request #2116 from dhiltgen/cc_50_80 Add support for CUDA 5.0 cards	2024-01-27 10:28:38 -08:00
Daniel Hiltgen	667a2ba18a	Detect lack of AVX and fallback to CPU mode We build the GPU libraries with AVX enabled to ensure that if not all layers fit on the GPU we get better performance in a mixed mode. If the user is using a virtualization/emulation system that lacks AVX this used to result in an illegal instruction error and crash before this fix. Now we will report a warning in the server log, and just use CPU mode to ensure we don't crash.	2024-01-26 11:36:03 -08:00
Daniel Hiltgen	9d7b5d6c91	Ignore AMD integrated GPUs Detect and ignore integrated GPUs reported by rocm.	2024-01-26 09:21:35 -08:00
Daniel Hiltgen	013fd07139	More logging for gpu management Fix an ordering glitch of dlerr/dlclose and add more logging to help root cause some crashes users are hitting. This also refines the function pointer names to use the underlying function names instead of simplified names for readability.	2024-01-24 10:32:36 -08:00
Daniel Hiltgen	987c16b2f7	Report more information about GPUs in verbose mode This adds additional calls to both CUDA and ROCm management libraries to discover additional attributes about the GPU(s) detected in the system, and wires up runtime verbosity selection. When users hit problems with GPUs we can ask them to run with `OLLAMA_DEBUG=1 ollama serve` and share the results.	2024-01-23 11:37:02 -08:00
Daniel Hiltgen	a447a083f2	Add compute capability 5.0, 7.5, and 8.0	2024-01-20 14:24:05 -08:00
Jeffrey Morgan	f32ea81b21	increase minimum overhead to 1024MiB (#2114 )	2024-01-20 17:11:38 -05:00
Daniel Hiltgen	681a914990	Add support for CUDA 5.2 cards	2024-01-20 10:48:43 -08:00
Daniel Hiltgen	552db98bf1	More WSL paths	2024-01-19 13:23:29 -08:00
Self Denial	eb76f3e379	Fix CPU-only build under Android Termux enviornment. Update gpu.go initGPUHandles() to declare gpuHandles variable before reading it. This resolves an "invalid memory address or nil pointer dereference" error. Update dyn_ext_server.c to avoid setting the RTLD_DEEPBIND flag under __TERMUX__ (Android).	2024-01-18 17:25:33 -07:00
Daniel Hiltgen	abec7f06e5	Merge pull request #2056 from dhiltgen/slog Mechanical switch from log to slog	2024-01-18 14:27:24 -08:00
Daniel Hiltgen	fedd705aea	Mechanical switch from log to slog A few obvious levels were adjusted, but generally everything mapped to "info" level.	2024-01-18 14:12:57 -08:00
Alexander F. Rødseth	f4bf1d514f	Let gpu.go and gen_linux.sh also find CUDA on Arch Linux	2024-01-14 13:40:36 +01:00
Daniel Hiltgen	d88c527be3	Build multiple CPU variants and pick the best This reduces the built-in linux version to not use any vector extensions which enables the resulting builds to run under Rosetta on MacOS in Docker. Then at runtime it checks for the actual CPU vector extensions and loads the best CPU library available	2024-01-11 08:42:47 -08:00
Daniel Hiltgen	8da7bef05f	Support multiple variants for a given llm lib type In some cases we may want multiple variants for a given GPU type or CPU. This adds logic to have an optional Variant which we can use to select an optimal library, but also allows us to try multiple variants in case some fail to load. This can be useful for scenarios such as ROCm v5 vs v6 incompatibility or potentially CPU features.	2024-01-10 17:27:51 -08:00
Jeffrey Morgan	b24e8d17b2	Increase minimum CUDA memory allocation overhead and fix minimum overhead for multi-gpu (#1896 ) * increase minimum cuda overhead and fix minimum overhead for multi-gpu * fix multi gpu overhead * limit overhead to 10% of all gpus * better wording * allocate fixed amount before layers * fixed only includes graph alloc	2024-01-10 19:08:51 -05:00
Daniel Hiltgen	3c49c3ab0d	Harden GPU mgmt library lookup When there are multiple management libraries installed on a system not every one will be compatible with the current driver. This change improves our management library algorithm to build up a set of discovered libraries based on glob patterns, and then try all of them until we're able to load one without error.	2024-01-10 15:06:41 -08:00
Jeffrey Morgan	c336693f07	calculate overhead based number of gpu devices (#1875 )	2024-01-09 15:53:33 -05:00
Daniel Hiltgen	1961a81f03	Set corret CUDA minimum compute capability version If you attempt to run the current CUDA build on compute capability 5.2 cards, you'll hit the following failure: cuBLAS error 15 at ggml-cuda.cu:7956: the requested functionality is not supported	2024-01-09 11:28:24 -08:00
Jeffrey Morgan	6df83e6daa	update rough cuda overhead estimate to 15% + 384MiB	2024-01-09 13:51:08 -05:00
Jeffrey Morgan	6164f378f2	revert cuda overhead to 20%	2024-01-09 00:54:29 -05:00
Jeffrey Morgan	6566387ae3	add `TODO` for cuda overhead	2024-01-09 00:28:03 -05:00
Jeffrey Morgan	37708931fb	update cuda overhead to 20% to fix crashes when switching between models and large context sizes	2024-01-09 00:05:23 -05:00
Jeffrey Morgan	f6cb0a553c	update cuda overhead to 15% or 400MiB	2024-01-08 23:45:45 -05:00
Jeffrey Morgan	2680078c13	fix build on linux	2024-01-08 23:44:13 -05:00
Jeffrey Morgan	f1b7e5f560	update overhead to 15%	2024-01-08 23:37:45 -05:00
Jeffrey Morgan	cb534e6ac2	use 10% vram overhead for cuda	2024-01-08 23:17:44 -05:00
Jeffrey Morgan	08f1e18965	Offload layers to GPU based on new model size estimates (#1850 ) * select layers based on estimated model memory usage * always account for scratch vram * dont load +1 layers * better estmation for graph alloc * Update gpu/gpu_darwin.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go * add overhead for cuda memory * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * fix build error on linux * address comments --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2024-01-08 16:42:00 -05:00
Daniel Hiltgen	d74ce6bd4f	Detect very old CUDA GPUs and fall back to CPU If we try to load the CUDA library on an old GPU, it panics and crashes the server. This checks the compute capability before we load the library so we can gracefully fall back to CPU mode.	2024-01-06 21:40:29 -08:00
Daniel Hiltgen	a2ad952440	Fix windows system memory lookup This refines the gpu package error handling and fixes a bug with the system memory lookup on windows.	2024-01-03 08:50:01 -08:00
Daniel Hiltgen	d966b730ac	Switch windows build to fully dynamic Refactor where we store build outputs, and support a fully dynamic loading model on windows so the base executable has no special dependencies thus doesn't require a special PATH.	2024-01-02 15:36:16 -08:00
Daniel Hiltgen	7555ea44f8	Revamp the dynamic library shim This switches the default llama.cpp to be CPU based, and builds the GPU variants as dynamically loaded libraries which we can select at runtime. This also bumps the ROCm library to version 6 given 5.7 builds don't work on the latest ROCm library that just shipped.	2023-12-20 14:45:57 -08:00
Daniel Hiltgen	1b991d0ba9	Refine build to support CPU only If someone checks out the ollama repo and doesn't install the CUDA library, this will ensure they can build a CPU only version	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	35934b2e05	Adapted rocm support to cgo based llama.cpp	2023-12-19 09:05:46 -08:00

37 commits