ollama

Author	SHA1	Message	Date
Daniel Hiltgen	05cd82ef94	Rename gpu package discover (#7143 ) Cleaning up go package naming	2024-10-16 17:45:00 -07:00
Daniel Hiltgen	24636dfa87	Discovery CPU details for default thread selection (#6264 ) On windows, detect large multi-socket systems and reduce to the number of cores in one socket for best performance	2024-10-15 11:36:08 -07:00
Daniel Hiltgen	f3c8b898cd	Track GPU discovery failure information (#5820 ) * Expose GPU discovery failure information * Remove exposed API for now	2024-10-14 16:26:45 -07:00
Daniel Hiltgen	d470ebe78b	Add Jetson cuda variants for arm This adds new variants for arm64 specific to Jetson platforms	2024-08-19 09:38:53 -07:00
Michael Yang	b732beba6a	lint	2024-08-01 17:06:06 -07:00
Jeffrey Morgan	c4cf8ad559	llm: avoid loading model if system memory is too small (#5637 ) * llm: avoid loading model if system memory is too small * update log * Instrument swap free space On linux and windows, expose how much swap space is available so we can take that into consideration when scheduling models * use `systemSwapFreeMemory` in check --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2024-07-11 16:42:57 -07:00
Jeffrey Morgan	f8241bfba3	gpu: report system free memory instead of 0 (#5521 )	2024-07-06 19:35:04 -04:00
Daniel Hiltgen	6f351bf586	review comments and coverage	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	fc37c192ae	Refine CPU load behavior with system memory visibility	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	30a7d7096c	Bump VRAM buffer back up Under stress scenarios we're seeing OOMs so this should help stabilize the allocations under heavy concurrency stress.	2024-05-10 09:15:28 -07:00
Michael Yang	4736391bfb	llm: add minimum based on layer size	2024-05-06 17:04:19 -07:00
Jeffrey Morgan	f0c454ab57	gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068 )	2024-05-01 11:46:03 -04:00
Daniel Hiltgen	34b9db5afc	Request and model concurrency This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.	2024-04-22 19:29:12 -07:00
Michael Yang	26df674785	scale graph based on gpu count	2024-04-16 14:44:13 -07:00
Michael Yang	41a272de9f	darwin: no partial offloading if required memory greater than system	2024-04-16 11:22:38 -07:00
Michael Yang	7e33a017c0	partial offloading	2024-04-10 11:37:20 -07:00
Daniel Hiltgen	be330174dd	Allow setting max vram for workarounds Until we get all the memory calculations correct, this can provide and escape valve for users to workaround out of memory crashes.	2024-03-06 17:15:06 -08:00
peanut256	a189810df6	Determine max VRAM on macOS using `recommendedMaxWorkingSetSize` (#2354 ) * read iogpu.wired_limit_mb on macOS Fix for https://github.com/ollama/ollama/issues/1826 * improved determination of available vram on macOS read the recommended maximal vram on macOS via Metal API * Removed macOS-specific logging * Remove logging from gpu_darwin.go * release Core Foundation object fixes a possible memory leak	2024-02-25 18:16:45 -05:00
Daniel Hiltgen	7427fa1387	Fix up the CPU fallback selection The memory changes and multi-variant change had some merge glitches I missed. This fixes them so we actually get the cpu llm lib and best variant for the given system.	2024-01-11 15:27:06 -08:00
Daniel Hiltgen	39928a42e8	Always dynamically load the llm server library This switches darwin to dynamic loading, and refactors the code now that no static linking of the library is used on any platform	2024-01-11 08:42:47 -08:00
Daniel Hiltgen	d88c527be3	Build multiple CPU variants and pick the best This reduces the built-in linux version to not use any vector extensions which enables the resulting builds to run under Rosetta on MacOS in Docker. Then at runtime it checks for the actual CPU vector extensions and loads the best CPU library available	2024-01-11 08:42:47 -08:00
Jeffrey Morgan	c336693f07	calculate overhead based number of gpu devices (#1875 )	2024-01-09 15:53:33 -05:00
Jeffrey Morgan	08f1e18965	Offload layers to GPU based on new model size estimates (#1850 ) * select layers based on estimated model memory usage * always account for scratch vram * dont load +1 layers * better estmation for graph alloc * Update gpu/gpu_darwin.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go * add overhead for cuda memory * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * fix build error on linux * address comments --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2024-01-08 16:42:00 -05:00
Jeffrey Morgan	c7ea8f237e	set `num_gpu` to 1 only by default on darwin arm64 (#1771 )	2024-01-03 14:10:29 -05:00
Daniel Hiltgen	a2ad952440	Fix windows system memory lookup This refines the gpu package error handling and fixes a bug with the system memory lookup on windows.	2024-01-03 08:50:01 -08:00
Daniel Hiltgen	d966b730ac	Switch windows build to fully dynamic Refactor where we store build outputs, and support a fully dynamic loading model on windows so the base executable has no special dependencies thus doesn't require a special PATH.	2024-01-02 15:36:16 -08:00
Daniel Hiltgen	7555ea44f8	Revamp the dynamic library shim This switches the default llama.cpp to be CPU based, and builds the GPU variants as dynamically loaded libraries which we can select at runtime. This also bumps the ROCm library to version 6 given 5.7 builds don't work on the latest ROCm library that just shipped.	2023-12-20 14:45:57 -08:00
Daniel Hiltgen	6558f94ed0	Fix darwin intel build	2023-12-19 13:32:24 -08:00
Daniel Hiltgen	35934b2e05	Adapted rocm support to cgo based llama.cpp	2023-12-19 09:05:46 -08:00

29 commits