ollama

Author	SHA1	Message	Date
Jeffrey Morgan	4458efb73a	Load all layers on `arm64` macOS if model is small enough (#2149 )	2024-01-22 17:40:06 -08:00
Daniel Hiltgen	fedd705aea	Mechanical switch from log to slog A few obvious levels were adjusted, but generally everything mapped to "info" level.	2024-01-18 14:12:57 -08:00
Michael Yang	eaed6f8c45	add max context length check	2024-01-12 14:54:07 -08:00
Daniel Hiltgen	7427fa1387	Fix up the CPU fallback selection The memory changes and multi-variant change had some merge glitches I missed. This fixes them so we actually get the cpu llm lib and best variant for the given system.	2024-01-11 15:27:06 -08:00
Daniel Hiltgen	de2fbdec99	Merge pull request #1819 from dhiltgen/multi_variant Support multiple LLM libs; ROCm v5 and v6; Rosetta, AVX, and AVX2 compatible CPU builds	2024-01-11 14:00:48 -08:00
Michael Yang	f4f939de28	Merge pull request #1552 from jmorganca/mxyng/lint-test add lint and test on pull_request	2024-01-11 09:37:45 -08:00
Daniel Hiltgen	39928a42e8	Always dynamically load the llm server library This switches darwin to dynamic loading, and refactors the code now that no static linking of the library is used on any platform	2024-01-11 08:42:47 -08:00
Daniel Hiltgen	d88c527be3	Build multiple CPU variants and pick the best This reduces the built-in linux version to not use any vector extensions which enables the resulting builds to run under Rosetta on MacOS in Docker. Then at runtime it checks for the actual CPU vector extensions and loads the best CPU library available	2024-01-11 08:42:47 -08:00
Jeffrey Morgan	ab6be852c7	revisit memory allocation to account for full kv cache on main gpu	2024-01-11 01:45:31 -05:00
Daniel Hiltgen	8da7bef05f	Support multiple variants for a given llm lib type In some cases we may want multiple variants for a given GPU type or CPU. This adds logic to have an optional Variant which we can use to select an optimal library, but also allows us to try multiple variants in case some fail to load. This can be useful for scenarios such as ROCm v5 vs v6 incompatibility or potentially CPU features.	2024-01-10 17:27:51 -08:00
Jeffrey Morgan	b24e8d17b2	Increase minimum CUDA memory allocation overhead and fix minimum overhead for multi-gpu (#1896 ) * increase minimum cuda overhead and fix minimum overhead for multi-gpu * fix multi gpu overhead * limit overhead to 10% of all gpus * better wording * allocate fixed amount before layers * fixed only includes graph alloc	2024-01-10 19:08:51 -05:00
Michael Yang	f921e2696e	typo	2024-01-09 09:45:42 -08:00
Jeffrey Morgan	f387e9631b	use runner if cuda alloc won't fit	2024-01-09 00:44:34 -05:00
Jeffrey Morgan	cb534e6ac2	use 10% vram overhead for cuda	2024-01-08 23:17:44 -05:00
Jeffrey Morgan	58ce2d8273	better estimate scratch buffer size	2024-01-08 21:32:44 -05:00
Jeffrey Morgan	08f1e18965	Offload layers to GPU based on new model size estimates (#1850 ) * select layers based on estimated model memory usage * always account for scratch vram * dont load +1 layers * better estmation for graph alloc * Update gpu/gpu_darwin.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go * add overhead for cuda memory * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * fix build error on linux * address comments --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2024-01-08 16:42:00 -05:00
Daniel Hiltgen	e9ce91e9a6	Load dynamic cpu lib on windows On linux, we link the CPU library in to the Go app and fall back to it when no GPU match is found. On windows we do not link in the CPU library so that we can better control our dependencies for the CLI. This fixes the logic so we correctly fallback to the dynamic CPU library on windows.	2024-01-04 08:41:41 -08:00
Jeffrey Morgan	c0285158a9	tweak memory requirements error text	2024-01-03 19:47:18 -05:00
Jeffrey Morgan	77a66df72c	add macOS memory check for 47B models	2024-01-03 19:46:16 -05:00
Jeffrey Morgan	5b4837f881	remove unused filetype check	2024-01-03 19:45:39 -05:00
Daniel Hiltgen	7555ea44f8	Revamp the dynamic library shim This switches the default llama.cpp to be CPU based, and builds the GPU variants as dynamically loaded libraries which we can select at runtime. This also bumps the ROCm library to version 6 given 5.7 builds don't work on the latest ROCm library that just shipped.	2023-12-20 14:45:57 -08:00
Daniel Hiltgen	3269535a4c	Refine handling of shim presence This allows the CPU only builds to work on systems with Radeon cards	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	35934b2e05	Adapted rocm support to cgo based llama.cpp	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	d4cd695759	Add cgo implementation for llama.cpp Run the server.cpp directly inside the Go runtime via cgo while retaining the LLM Go abstractions.	2023-12-19 09:05:46 -08:00
Bruce MacDonald	811b1f03c8	deprecate ggml - remove ggml runner - automatically pull gguf models when ggml detected - tell users to update to gguf in the case automatic pull fails Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>	2023-12-19 09:05:46 -08:00
Michael Yang	b9495ea162	load projectors	2023-12-05 14:36:12 -08:00
Bruce MacDonald	195e3d9dbd	chat api endpoint (#1392 )	2023-12-05 14:57:33 -05:00
Jeffrey Morgan	00d06619a1	Revert "chat api (#991 )" while context variable is fixed This reverts commit `7a0899d62d`.	2023-12-04 21:16:27 -08:00
Bruce MacDonald	7a0899d62d	chat api (#991 ) - update chat docs - add messages chat endpoint - remove deprecated context and template generate parameters from docs - context and template are still supported for the time being and will continue to work as expected - add partial response to chat history	2023-12-04 18:01:06 -05:00
Michael Yang	19b7a4d715	recent llama.cpp update added kernels for fp32, q5_0, and q5_1	2023-11-20 13:44:31 -08:00
Jeffrey Morgan	5cba29b9d6	JSON mode: add `"format" as an api parameter (#1051 ) * add `"format": "json"` as an API parameter --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-11-09 16:44:02 -08:00
Jeffrey Morgan	2e53704685	default rope params to 0 for new models (#968 )	2023-11-02 08:41:30 -07:00
Jeffrey Morgan	7ed5a39bc7	simpler check for model loading compatibility errors	2023-10-19 14:50:49 -04:00
Jeffrey Morgan	a7dad24d92	add error for `falcon` and `starcoder` vocab compatibility (#844 ) add error for falcon and starcoder vocab compatibility --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-10-19 12:18:31 -04:00
Michael Yang	36fe2deebf	only check system memory on macos	2023-10-13 14:47:29 -07:00
Michael Yang	4a8931f634	check total (system + video) memory	2023-10-13 14:47:29 -07:00
Michael Yang	bd6e38fb1a	refactor memory check	2023-10-13 14:47:29 -07:00
Michael Yang	92189a5855	fix memory check	2023-10-13 14:47:29 -07:00
Michael Yang	b599946b74	add format bytes	2023-10-11 14:08:23 -07:00
Bruce MacDonald	d06bc0cb6e	enable q8, q5, 5_1, and f32 for linux gpu (#699 )	2023-10-05 12:53:47 -04:00
Bruce MacDonald	86279f4ae3	unbound max num gpu layers (#591 ) --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-25 18:36:46 -04:00
Bruce MacDonald	4cba75efc5	remove tmp directories created by previous servers (#559 ) * remove tmp directories created by previous servers * clean up on server stop * Update routes.go * Update server/routes.go Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> * create top-level temp ollama dir * check file exists before creating --------- Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-21 20:38:49 +01:00
Michael Yang	7dee25a07f	fix falcon decode get model and file type from bin file	2023-09-12 12:34:53 -07:00
Bruce MacDonald	09dd2aeff9	GGUF support (#441 )	2023-09-07 13:55:37 -04:00
Bruce MacDonald	42998d797d	subprocess llama.cpp server (#401 ) * remove c code * pack llama.cpp * use request context for llama_cpp * let llama_cpp decide the number of threads to use * stop llama runner when app stops * remove sample count and duration metrics * use go generate to get libraries * tmp dir for running llm	2023-08-30 16:35:03 -04:00
Michael Yang	b25dd1795d	allow F16 to use metal warning F16 uses significantly more memory than quantized model so the standard requires don't apply.	2023-08-26 08:38:48 -07:00
Michael Yang	304f2b6c96	add 34b to mem check	2023-08-26 08:29:21 -07:00
Michael Yang	a894cc792d	model and file type as strings	2023-08-17 12:08:04 -07:00
Michael Yang	e26085b921	close open files	2023-08-14 16:08:06 -07:00
Michael Yang	6de5d032e1	implement loading ggml lora adapters through the modelfile	2023-08-10 09:23:39 -07:00

1 2

53 commits