ollama

Author	SHA1	Message	Date
Josh	10e768826c	fix: quant err message (#5616 )	2024-07-11 17:24:29 -07:00
Daniel Hiltgen	b51e3b63ac	Statically link c++ and thread lib This makes sure we statically link the c++ and thread library on windows to avoid unnecessary runtime dependencies on non-standard DLLs	2024-07-09 11:34:30 -07:00
jmorganca	a08f20d910	release: remove unwanted mingw dll.a files	2024-07-06 15:21:15 -04:00
jmorganca	6cea036027	Revert "llm: only statically link libstdc++" This reverts commit `5796bfc401`.	2024-07-06 15:10:48 -04:00
jmorganca	5796bfc401	llm: only statically link libstdc++	2024-07-06 14:06:20 -04:00
jmorganca	f1a379aa56	llm: statically link pthread and stdc++ dependencies in windows build	2024-07-06 12:54:02 -04:00
Jeffrey Morgan	5304b765b2	llm: put back old include dir (#5507 ) * llm: put back old include dir * llm: update link paths for old submodule commits	2024-07-05 19:34:21 -04:00
Jeffrey Morgan	78fb33dd07	fix typo in cgo directives in `llm.go` (#5501 )	2024-07-05 15:18:36 -04:00
Jeffrey Morgan	8f8e736b13	update llama.cpp submodule to `d7fd29f` (#5475 )	2024-07-05 13:25:58 -04:00
Michael Yang	829ff87bd1	revert tokenize ffi (#4761 ) * Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit `763bb65dbb`. * Revert "vocab only" This reverts commit `bf54c845e9`. * Revert "use ffi for tokenizing/detokenizing" This reverts commit `26a00a0410`.	2024-05-31 18:54:21 -07:00
Jeffrey Morgan	763bb65dbb	use `int32_t` for call to tokenize (#4738 ) * use `int32_t` for call to tokenize * variable naming * cleanup * fix crash	2024-05-30 21:43:30 -07:00
Michael Yang	bf54c845e9	vocab only	2024-05-30 16:49:28 -07:00
Michael Yang	26a00a0410	use ffi for tokenizing/detokenizing	2024-05-29 11:26:47 -07:00
Michael Yang	01811c176a	comments	2024-05-06 15:24:01 -07:00
Michael Yang	9685c34509	quantize any fp16/fp32 model - FROM /path/to/{safetensors,pytorch} - FROM /path/to/fp{16,32}.bin - FROM model:fp{16,32}	2024-05-06 15:24:01 -07:00
Hernan Martinez	86e67fc4a9	Add import declaration for windows,arm64 to llm.go	2024-04-26 23:23:53 -06:00
Michael Yang	9502e5661f	cgo quantize	2024-04-08 15:31:08 -07:00
Daniel Hiltgen	58d95cc9bd	Switch back to subprocessing for llama.cpp This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.	2024-04-01 16:48:18 -07:00
Michael Yang	91b3e4d282	update memory calcualtions count each layer independently when deciding gpu offloading	2024-04-01 13:16:32 -07:00
Michael Yang	d338d70492	refactor model parsing	2024-04-01 13:16:15 -07:00
Patrick Devine	1b272d5bcd	change `github.com/jmorganca/ollama` to `github.com/ollama/ollama` (#3347 )	2024-03-26 13:04:17 -07:00
Jeffrey Morgan	f9cd55c70b	disable gpu for certain model architectures and fix divide-by-zero on memory estimation	2024-03-09 12:51:38 -08:00
Daniel Hiltgen	6c5ccb11f9	Revamp ROCm support This refines where we extract the LLM libraries to by adding a new OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already idempotenent, so this should speed up startups after the first time a new release is deployed. It also cleans up after itself. We now build only a single ROCm version (latest major) on both windows and linux. Given the large size of ROCms tensor files, we split the dependency out. It's bundled into the installer on windows, and a separate download on windows. The linux install script is now smart and detects the presence of AMD GPUs and looks to see if rocm v6 is already present, and if not, then downloads our dependency tar file. For Linux discovery, we now use sysfs and check each GPU against what ROCm supports so we can degrade to CPU gracefully instead of having llama.cpp+rocm assert/crash on us. For Windows, we now use go's windows dynamic library loading logic to access the amdhip64.dll APIs to query the GPU information.	2024-03-07 10:36:50 -08:00
Daniel Hiltgen	a1dfab43b9	Ensure the libraries are present When we store our libraries in a temp dir, a reaper might clean them when we are idle, so make sure to check for them before we reload.	2024-02-07 17:27:49 -08:00
Jeffrey Morgan	4458efb73a	Load all layers on `arm64` macOS if model is small enough (#2149 )	2024-01-22 17:40:06 -08:00
Daniel Hiltgen	fedd705aea	Mechanical switch from log to slog A few obvious levels were adjusted, but generally everything mapped to "info" level.	2024-01-18 14:12:57 -08:00
Michael Yang	eaed6f8c45	add max context length check	2024-01-12 14:54:07 -08:00
Daniel Hiltgen	7427fa1387	Fix up the CPU fallback selection The memory changes and multi-variant change had some merge glitches I missed. This fixes them so we actually get the cpu llm lib and best variant for the given system.	2024-01-11 15:27:06 -08:00
Daniel Hiltgen	de2fbdec99	Merge pull request #1819 from dhiltgen/multi_variant Support multiple LLM libs; ROCm v5 and v6; Rosetta, AVX, and AVX2 compatible CPU builds	2024-01-11 14:00:48 -08:00
Michael Yang	f4f939de28	Merge pull request #1552 from jmorganca/mxyng/lint-test add lint and test on pull_request	2024-01-11 09:37:45 -08:00
Daniel Hiltgen	39928a42e8	Always dynamically load the llm server library This switches darwin to dynamic loading, and refactors the code now that no static linking of the library is used on any platform	2024-01-11 08:42:47 -08:00
Daniel Hiltgen	d88c527be3	Build multiple CPU variants and pick the best This reduces the built-in linux version to not use any vector extensions which enables the resulting builds to run under Rosetta on MacOS in Docker. Then at runtime it checks for the actual CPU vector extensions and loads the best CPU library available	2024-01-11 08:42:47 -08:00
Jeffrey Morgan	ab6be852c7	revisit memory allocation to account for full kv cache on main gpu	2024-01-11 01:45:31 -05:00
Daniel Hiltgen	8da7bef05f	Support multiple variants for a given llm lib type In some cases we may want multiple variants for a given GPU type or CPU. This adds logic to have an optional Variant which we can use to select an optimal library, but also allows us to try multiple variants in case some fail to load. This can be useful for scenarios such as ROCm v5 vs v6 incompatibility or potentially CPU features.	2024-01-10 17:27:51 -08:00
Jeffrey Morgan	b24e8d17b2	Increase minimum CUDA memory allocation overhead and fix minimum overhead for multi-gpu (#1896 ) * increase minimum cuda overhead and fix minimum overhead for multi-gpu * fix multi gpu overhead * limit overhead to 10% of all gpus * better wording * allocate fixed amount before layers * fixed only includes graph alloc	2024-01-10 19:08:51 -05:00
Michael Yang	f921e2696e	typo	2024-01-09 09:45:42 -08:00
Jeffrey Morgan	f387e9631b	use runner if cuda alloc won't fit	2024-01-09 00:44:34 -05:00
Jeffrey Morgan	cb534e6ac2	use 10% vram overhead for cuda	2024-01-08 23:17:44 -05:00
Jeffrey Morgan	58ce2d8273	better estimate scratch buffer size	2024-01-08 21:32:44 -05:00
Jeffrey Morgan	08f1e18965	Offload layers to GPU based on new model size estimates (#1850 ) * select layers based on estimated model memory usage * always account for scratch vram * dont load +1 layers * better estmation for graph alloc * Update gpu/gpu_darwin.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go * add overhead for cuda memory * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * fix build error on linux * address comments --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2024-01-08 16:42:00 -05:00
Daniel Hiltgen	e9ce91e9a6	Load dynamic cpu lib on windows On linux, we link the CPU library in to the Go app and fall back to it when no GPU match is found. On windows we do not link in the CPU library so that we can better control our dependencies for the CLI. This fixes the logic so we correctly fallback to the dynamic CPU library on windows.	2024-01-04 08:41:41 -08:00
Jeffrey Morgan	c0285158a9	tweak memory requirements error text	2024-01-03 19:47:18 -05:00
Jeffrey Morgan	77a66df72c	add macOS memory check for 47B models	2024-01-03 19:46:16 -05:00
Jeffrey Morgan	5b4837f881	remove unused filetype check	2024-01-03 19:45:39 -05:00
Daniel Hiltgen	7555ea44f8	Revamp the dynamic library shim This switches the default llama.cpp to be CPU based, and builds the GPU variants as dynamically loaded libraries which we can select at runtime. This also bumps the ROCm library to version 6 given 5.7 builds don't work on the latest ROCm library that just shipped.	2023-12-20 14:45:57 -08:00
Daniel Hiltgen	3269535a4c	Refine handling of shim presence This allows the CPU only builds to work on systems with Radeon cards	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	35934b2e05	Adapted rocm support to cgo based llama.cpp	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	d4cd695759	Add cgo implementation for llama.cpp Run the server.cpp directly inside the Go runtime via cgo while retaining the LLM Go abstractions.	2023-12-19 09:05:46 -08:00
Bruce MacDonald	811b1f03c8	deprecate ggml - remove ggml runner - automatically pull gguf models when ggml detected - tell users to update to gguf in the case automatic pull fails Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>	2023-12-19 09:05:46 -08:00
Michael Yang	b9495ea162	load projectors	2023-12-05 14:36:12 -08:00

1 2

77 commits