ollama

Author	SHA1	Message	Date
Daniel Hiltgen	1b991d0ba9	Refine build to support CPU only If someone checks out the ollama repo and doesn't install the CUDA library, this will ensure they can build a CPU only version	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	9adca7f711	Bump llama.cpp to b1662 and set n_parallel=1	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	89bbaafa64	Build linux using ubuntu 20.04 This changes the container-based linux build to use an older Ubuntu distro to improve our compatibility matrix for older user machines	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	35934b2e05	Adapted rocm support to cgo based llama.cpp	2023-12-19 09:05:46 -08:00
65a	f8ef4439e9	Use build tags to generate accelerated binaries for CUDA and ROCm on Linux. The build tags rocm or cuda must be specified to both go generate and go build. ROCm builds should have both ROCM_PATH set (and the ROCM SDK present) as well as CLBlast installed (for GGML) and CLBlast_DIR set in the environment to the CLBlast cmake directory (likely /usr/lib/cmake/CLBlast). Build tags are also used to switch VRAM detection between cuda and rocm implementations, using added "accelerator_foo.go" files which contain architecture specific functions and variables. accelerator_none is used when no tags are set, and a helper function addRunner will ignore it if it is the chosen accelerator. Fix go generate commands, thanks @deadmeu for testing.	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	d4cd695759	Add cgo implementation for llama.cpp Run the server.cpp directly inside the Go runtime via cgo while retaining the LLM Go abstractions.	2023-12-19 09:05:46 -08:00
Bruce MacDonald	811b1f03c8	deprecate ggml - remove ggml runner - automatically pull gguf models when ggml detected - tell users to update to gguf in the case automatic pull fails Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>	2023-12-19 09:05:46 -08:00
Jeffrey Morgan	6b5bdfa6c9	update runner submodule	2023-12-18 17:33:46 -05:00
Jeffrey Morgan	c063ee4af0	update runner submodule to fix hipblas build	2023-12-18 15:41:13 -05:00
Jeffrey Morgan	b85982eb91	update runner submodule	2023-12-18 12:43:31 -05:00
Bruce MacDonald	6ee8c80199	restore model load duration on generate response (#1524 ) * restore model load duration on generate response - set model load duration on generate and chat done response - calculate createAt time when response created * remove checkpoints predict opts * Update routes.go	2023-12-14 12:15:50 -05:00
Jeffrey Morgan	31f0551dab	Update runner to support mixtral and mixture of experts (MoE) (#1475 )	2023-12-13 17:15:10 -05:00
Michael Yang	4251b342de	Merge pull request #1469 from jmorganca/mxyng/model-types remove per-model types	2023-12-12 12:27:03 -08:00
Bruce MacDonald	3144e2a439	exponential back-off (#1484 )	2023-12-12 12:33:02 -05:00
Bruce MacDonald	c0960e29b5	retry on concurrent request failure (#1483 ) - remove parallel	2023-12-12 12:14:35 -05:00
Patrick Devine	910e9401d0	Multimodal support (#1216 ) --------- Co-authored-by: Matt Apperson <mattapperson@Matts-MacBook-Pro.local>	2023-12-11 13:56:22 -08:00
Michael Yang	56ffc3023a	remove per-model types mostly replaced by decoding tensors except ggml models which only support llama	2023-12-11 09:40:21 -08:00
Jeffrey Morgan	fa2f095bd9	fix model name returned by `/api/generate` being different than the model name provided	2023-12-10 11:42:15 -05:00
Jeffrey Morgan	d9a250e9b5	seek to end of file when decoding older model formats	2023-12-09 21:14:35 -05:00
Jeffrey Morgan	944519ed16	seek to eof for older model binaries	2023-12-09 20:48:57 -05:00
Jeffrey Morgan	2dd040d04c	do not use `--parallel 2` for old runners	2023-12-09 20:17:33 -05:00
Bruce MacDonald	bbe41ce41a	fix: parallel queueing race condition caused silent failure (#1445 ) * fix: queued request failures - increase parallel requests to 2 to complete queued request, queueing is managed in ollama * log steam errors	2023-12-09 14:14:02 -05:00
Michael Yang	f1b049fed8	Merge pull request #1377 from jmorganca/mxyng/qwen update for qwen	2023-12-06 12:31:51 -08:00
Michael Yang	b9495ea162	load projectors	2023-12-05 14:36:12 -08:00
Michael Yang	409bb9674e	Merge pull request #1308 from jmorganca/mxyng/split-from split from into one or more models	2023-12-05 14:33:03 -08:00
Michael Yang	d3479c07a1	Merge pull request #1250 from jmorganca/mxyng/create-layer refactor layer creation	2023-12-05 14:32:52 -08:00
Bruce MacDonald	195e3d9dbd	chat api endpoint (#1392 )	2023-12-05 14:57:33 -05:00
Jeffrey Morgan	00d06619a1	Revert "chat api (#991 )" while context variable is fixed This reverts commit `7a0899d62d`.	2023-12-04 21:16:27 -08:00
Michael Yang	5a5dca13b2	comments	2023-12-04 16:59:23 -08:00
Michael Yang	72e7a49aa9	seek instead of copyn	2023-12-04 16:59:23 -08:00
Michael Yang	2cb0fa7d40	split from into one or more models	2023-12-04 16:59:23 -08:00
Michael Yang	b2816bca67	unnecessary ReadSeeker for DecodeGGML	2023-12-04 16:59:23 -08:00
Bruce MacDonald	7a0899d62d	chat api (#991 ) - update chat docs - add messages chat endpoint - remove deprecated context and template generate parameters from docs - context and template are still supported for the time being and will continue to work as expected - add partial response to chat history	2023-12-04 18:01:06 -05:00
Michael Yang	6deebf2489	update for qwen	2023-12-04 11:38:05 -08:00
Jeffrey Morgan	16a9006306	add back `f16c` instructions on intel mac	2023-11-26 15:59:49 -05:00
Jeffrey Morgan	9e4a316405	update submodule commit	2023-11-26 14:52:00 -05:00
Jing Zhang	82b9b329ff	windows CUDA support (#1262 ) * Support cuda build in Windows * Enable dynamic NumGPU allocation for Windows	2023-11-24 17:16:36 -05:00
Jongwook Choi	12e8c12d2b	Disable CUDA peer access as a workaround for multi-gpu inference bug (#1261 ) When CUDA peer access is enabled, multi-gpu inference will produce garbage output. This is a known bug of llama.cpp (or nvidia). Until the upstream bug is fixed, we can disable CUDA peer access temporarily to ensure correct output. See #961.	2023-11-24 14:05:57 -05:00
Jeffrey Morgan	d77dde126b	consistent cpu instructions on macos and linux	2023-11-22 16:26:46 -05:00
Michael Yang	199941cd15	fix: gguf int type	2023-11-22 11:40:30 -08:00
Michael Yang	a00fac4ec8	update llama.cpp	2023-11-21 09:50:02 -08:00
Jeffrey Morgan	a3fcecf943	only set `main_gpu` if value > 0 is provided	2023-11-20 19:54:04 -05:00
Michael Yang	19b7a4d715	recent llama.cpp update added kernels for fp32, q5_0, and q5_1	2023-11-20 13:44:31 -08:00
Purinda Gunasekara	be61a81758	main-gpu argument is not getting passed to llamacpp, fixed. (#1192 )	2023-11-20 10:52:52 -05:00
Jeffrey Morgan	13ba6df5ab	enable cpu instructions on intel macs	2023-11-19 23:20:26 -05:00
Jeffrey Morgan	36a3bbf65f	Update llm/llama.go	2023-11-18 21:25:07 -05:00
Bruce MacDonald	43a726149d	fix potentially inaccurate error message	2023-11-18 21:25:07 -05:00
Jeffrey Morgan	41434a7cdc	build intel mac with correct binary and compile flags	2023-11-16 22:14:51 -05:00
Jeffrey Morgan	5cba29b9d6	JSON mode: add `"format" as an api parameter (#1051 ) * add `"format": "json"` as an API parameter --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-11-09 16:44:02 -08:00
Bruce MacDonald	1ae84bc2a2	skip gpu if less than 2GB VRAM are available (#1059 )	2023-11-09 13:16:16 -08:00
Michael Yang	c5e1bbabda	instead of static number of parameters for each model family, get the real number from the tensors (#1022 ) * parse tensor info * refactor decoder * return actual parameter count * explicit rounding * s/Human/HumanNumber/	2023-11-08 17:55:46 -08:00
Jeffrey Morgan	c44b619428	remove unused `fmt.Println`	2023-11-03 17:24:58 -07:00
Jeffrey Morgan	17678b7225	Restore system prompt on requests and default `num_keep` to `0`	2023-11-03 13:25:25 -07:00
Jeffrey Morgan	2e53704685	default rope params to 0 for new models (#968 )	2023-11-02 08:41:30 -07:00
Michael Yang	642128b75a	append LD_LIBRARY_PATH	2023-10-31 15:54:49 -07:00
Jeffrey Morgan	3a1ed9ff70	restore building runner with `AVX` on by default (#900 )	2023-10-27 12:13:44 -07:00
Bruce MacDonald	6d283882b1	catch insufficient permissions nvidia err (#934 )	2023-10-27 12:42:40 -04:00
Bruce MacDonald	2665f3c28e	offload 75% of available vram to improve stability (#921 )	2023-10-26 20:49:55 -04:00
Jeffrey Morgan	b0c9cd0f3b	fix metal assertion errors	2023-10-24 00:32:36 -07:00
Jeffrey Morgan	77f61c6301	update submodule commit	2023-10-24 00:30:27 -07:00
Jeffrey Morgan	f3604534e5	update submodule commit	2023-10-23 23:59:12 -07:00
Michael Yang	0c7a00a264	bump submodules pin to 9e70cc03229df19ca2d28ce23cc817198f897278 for now since 438c2ca83045a00ef244093d27e9ed41a8cb4ea9 is breaking	2023-10-23 11:17:59 -07:00
Michael Yang	36c160f1c3	Merge pull request #881 from jmorganca/mxyng/ggufv3 ggufv3	2023-10-23 10:50:45 -07:00
Michael Yang	c9167494cb	update default log target	2023-10-23 10:44:50 -07:00
Michael Yang	125d0a013a	ggufv3 ggufv3 adds support for big endianness, mainly for s390x architecture. while that's not currently supported for ollama, the change is simple. loosen version check to be more forward compatible. unless specified, gguf versions other v1 will be decoded into v2.	2023-10-23 09:35:49 -07:00
Jeffrey Morgan	7ed5a39bc7	simpler check for model loading compatibility errors	2023-10-19 14:50:49 -04:00
Jeffrey Morgan	a7dad24d92	add error for `falcon` and `starcoder` vocab compatibility (#844 ) add error for falcon and starcoder vocab compatibility --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-10-19 12:18:31 -04:00
Michael Yang	235e43d7f6	Merge pull request #833 from discovertomorrow/leadingspace Fix Issue with Leading Whitespaces in Decoded Context	2023-10-18 13:52:48 -07:00
Arne Müller	730996e530	use TrimPrefix instead of TrimLeft	2023-10-18 22:51:30 +02:00
Arne Müller	ce6197a8e0	removed redundant strings.CutPrefix from Decode	2023-10-18 22:47:20 +02:00
Arne Müller	46b9953f32	use strings.TrimLeft to remove spaces	2023-10-18 22:41:19 +02:00
Bruce MacDonald	565648f3f7	relay CUDA errors to the client (#825 )	2023-10-18 15:36:56 -04:00
Arne Müller	90c49bed57	moved removal of leading space into Predict	2023-10-18 20:08:26 +02:00
Arne Müller	5dc0cff459	fix whitespace removal	2023-10-18 08:15:27 +02:00
Michael Yang	08b0e04f40	Merge pull request #813 from jmorganca/mxyng/llama refactor llm/llama.go	2023-10-17 14:05:58 -07:00
Michael Yang	b36b0b71f8	use cut prefix	2023-10-17 14:01:39 -07:00
Michael Yang	094df37563	remove unused struct	2023-10-17 14:01:38 -07:00
Bruce MacDonald	f3648fd206	Update llama.cpp gguf to latest (#710 )	2023-10-17 16:55:16 -04:00
Bruce MacDonald	bd93a94abd	fix MB VRAM log output (#824 )	2023-10-17 15:35:16 -04:00
Michael Yang	f55bdb6f10	Merge pull request #799 from deichbewohner/jsonmarshaling Fix JSON Marshal Escaping for Special Characters	2023-10-17 08:46:02 -07:00
Michael Yang	2870a9bfc8	Merge pull request #812 from jmorganca/mxyng/fix-format-string fix: wrong format string type	2023-10-17 08:40:49 -07:00
Arne Müller	8fa3f366ad	Removed newline trimming and used buffer directly in POST request.	2023-10-17 08:17:35 +02:00
Michael Yang	fddb303f23	fix: format string wrong type	2023-10-16 16:14:28 -07:00
Michael Yang	cb4a80b693	fix: regression unsupported metal types omitting `--n-gpu-layers` means use metal on macos which isn't correct since ollama uses `num_gpu=0` to explicitly disable gpu for file types that are not implemented in metal	2023-10-16 14:37:20 -07:00
Arne Müller	ee94693b1a	handling unescaped json marshaling	2023-10-16 11:15:55 +02:00
Michael Yang	11d82d7b9b	update checkvram	2023-10-13 14:47:29 -07:00
Michael Yang	36fe2deebf	only check system memory on macos	2023-10-13 14:47:29 -07:00
Michael Yang	4a8931f634	check total (system + video) memory	2023-10-13 14:47:29 -07:00
Michael Yang	bd6e38fb1a	refactor memory check	2023-10-13 14:47:29 -07:00
Michael Yang	92189a5855	fix memory check	2023-10-13 14:47:29 -07:00
Michael Yang	d790bf9916	Merge pull request #783 from jmorganca/mxyng/fix-gpu-offloading fix: offloading on low end GPUs	2023-10-13 14:36:44 -07:00
Michael Yang	35afac099a	do not use gpu binary when num_gpu == 0	2023-10-13 14:32:12 -07:00
Michael Yang	811c3d1900	no gpu if vram < 2GB	2023-10-13 14:32:12 -07:00
Bruce MacDonald	6fe178134d	improve api error handling (#781 ) - remove new lines from llama.cpp error messages relayed to client - check api option types and return error on wrong type - change num layers from 95% VRAM to 92% VRAM	2023-10-13 16:57:10 -04:00
Bruce MacDonald	56497663c8	relay model runner error message to client (#720 ) * give direction to user when runner fails * also relay errors from timeout * increase timeout to 3 minutes	2023-10-12 11:16:37 -04:00
Michael Yang	b599946b74	add format bytes	2023-10-11 14:08:23 -07:00
Bruce MacDonald	77295f716e	prevent waiting on exited command (#752 ) * prevent waiting on exited command * close llama runner once	2023-10-11 12:32:13 -04:00
Bruce MacDonald	f2ba1311aa	improve vram safety with 5% vram memory buffer (#724 ) * check free memory not total * wait for subprocess to exit	2023-10-10 16:16:09 -04:00
Jeffrey Morgan	ab0668293c	llm: fix build on `amd64`	2023-10-06 14:39:54 -07:00
Bruce MacDonald	5d22319a2c	rename server subprocess (#700 ) - this makes it easier to see that the subprocess is associated with ollama	2023-10-06 10:15:42 -04:00

1 2 3 4 5

208 commits