ollama

Author	SHA1	Message	Date
Michael Yang	2870a9bfc8	Merge pull request #812 from jmorganca/mxyng/fix-format-string fix: wrong format string type	2023-10-17 08:40:49 -07:00
Arne Müller	8fa3f366ad	Removed newline trimming and used buffer directly in POST request.	2023-10-17 08:17:35 +02:00
Michael Yang	fddb303f23	fix: format string wrong type	2023-10-16 16:14:28 -07:00
Michael Yang	cb4a80b693	fix: regression unsupported metal types omitting `--n-gpu-layers` means use metal on macos which isn't correct since ollama uses `num_gpu=0` to explicitly disable gpu for file types that are not implemented in metal	2023-10-16 14:37:20 -07:00
Arne Müller	ee94693b1a	handling unescaped json marshaling	2023-10-16 11:15:55 +02:00
Michael Yang	11d82d7b9b	update checkvram	2023-10-13 14:47:29 -07:00
Michael Yang	92189a5855	fix memory check	2023-10-13 14:47:29 -07:00
Michael Yang	d790bf9916	Merge pull request #783 from jmorganca/mxyng/fix-gpu-offloading fix: offloading on low end GPUs	2023-10-13 14:36:44 -07:00
Michael Yang	35afac099a	do not use gpu binary when num_gpu == 0	2023-10-13 14:32:12 -07:00
Michael Yang	811c3d1900	no gpu if vram < 2GB	2023-10-13 14:32:12 -07:00
Bruce MacDonald	6fe178134d	improve api error handling (#781 ) - remove new lines from llama.cpp error messages relayed to client - check api option types and return error on wrong type - change num layers from 95% VRAM to 92% VRAM	2023-10-13 16:57:10 -04:00
Bruce MacDonald	56497663c8	relay model runner error message to client (#720 ) * give direction to user when runner fails * also relay errors from timeout * increase timeout to 3 minutes	2023-10-12 11:16:37 -04:00
Michael Yang	b599946b74	add format bytes	2023-10-11 14:08:23 -07:00
Bruce MacDonald	77295f716e	prevent waiting on exited command (#752 ) * prevent waiting on exited command * close llama runner once	2023-10-11 12:32:13 -04:00
Bruce MacDonald	f2ba1311aa	improve vram safety with 5% vram memory buffer (#724 ) * check free memory not total * wait for subprocess to exit	2023-10-10 16:16:09 -04:00
Bruce MacDonald	5d22319a2c	rename server subprocess (#700 ) - this makes it easier to see that the subprocess is associated with ollama	2023-10-06 10:15:42 -04:00
Bruce MacDonald	9e2de1bd2c	increase streaming buffer size (#692 )	2023-10-04 14:09:00 -04:00
Michael Yang	c02c0cd483	starcoder	2023-10-02 19:56:51 -07:00
Bruce MacDonald	b1f7123301	clean up num_gpu calculation code (#673 )	2023-10-02 14:53:42 -04:00
Bruce MacDonald	1fbf3585d6	Relay default values to llama runner (#672 ) * include seed in params for llama.cpp server and remove empty filter for temp * relay default predict options to llama.cpp - reorganize options to match predict request for readability * omit empty stop --------- Co-authored-by: hallh <hallh@users.noreply.github.com>	2023-10-02 14:53:16 -04:00
Bruce MacDonald	9771b1ec51	windows runner fixes (#637 )	2023-09-29 11:47:55 -04:00
Michael Yang	f40b3de758	use int64 consistently	2023-09-28 11:07:24 -07:00
Bruce MacDonald	86279f4ae3	unbound max num gpu layers (#591 ) --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-25 18:36:46 -04:00
Bruce MacDonald	4cba75efc5	remove tmp directories created by previous servers (#559 ) * remove tmp directories created by previous servers * clean up on server stop * Update routes.go * Update server/routes.go Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> * create top-level temp ollama dir * check file exists before creating --------- Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-21 20:38:49 +01:00
Bruce MacDonald	1255bc9b45	only package 11.8 runner	2023-09-20 20:00:41 +01:00
Bruce MacDonald	4e8be787c7	pack in cuda libs	2023-09-20 17:40:42 +01:00
Bruce MacDonald	66003e1d05	subprocess improvements (#524 ) * subprocess improvements - increase start-up timeout - when runner fails to start fail rather than timing out - try runners in order rather than choosing 1 runner - embed metal runner in metal dir rather than gpu - refactor logging and error messages * Update llama.go * Update llama.go * simplify by using glob	2023-09-18 15:16:32 -04:00
Bruce MacDonald	2540c9181c	support for packaging in multiple cuda runners (#509 ) * enable packaging multiple cuda versions * use nvcc cuda version if available --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-14 15:08:13 -04:00
Michael Yang	7dee25a07f	fix falcon decode get model and file type from bin file	2023-09-12 12:34:53 -07:00
Bruce MacDonald	f221637053	first pass at linux gpu support (#454 ) * linux gpu support * handle multiple gpus * add cuda docker image (#488) --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-12 11:04:35 -04:00
Bruce MacDonald	09dd2aeff9	GGUF support (#441 )	2023-09-07 13:55:37 -04:00
Bruce MacDonald	42998d797d	subprocess llama.cpp server (#401 ) * remove c code * pack llama.cpp * use request context for llama_cpp * let llama_cpp decide the number of threads to use * stop llama runner when app stops * remove sample count and duration metrics * use go generate to get libraries * tmp dir for running llm	2023-08-30 16:35:03 -04:00
Quinn Slack	f4432e1dba	treat stop as stop sequences, not exact tokens (#442 ) The `stop` option to the generate API is a list of sequences that should cause generation to stop. Although these are commonly called "stop tokens", they do not necessarily correspond to LLM tokens (per the LLM's tokenizer). For example, if the caller sends a generate request with `"stop":["\n"]`, then generation should stop on any token containing `\n` (and trim `\n` from the output), not just if the token exactly matches `\n`. If `stop` were interpreted strictly as LLM tokens, then it would require callers of the generate API to know the LLM's tokenizer and enumerate many tokens in the `stop` list. Fixes https://github.com/jmorganca/ollama/issues/295.	2023-08-30 11:53:42 -04:00
Michael Yang	5ca05c2e88	fix ModelType()	2023-08-18 11:23:38 -07:00
Michael Yang	a894cc792d	model and file type as strings	2023-08-17 12:08:04 -07:00
Bruce MacDonald	4b2d366c37	Update llama.go	2023-08-14 12:55:50 -03:00
Bruce MacDonald	56fd4e4ef2	log embedding eval timing	2023-08-14 12:51:31 -03:00
Jeffrey Morgan	22885aeaee	update `llama.cpp` to `f64d44a`	2023-08-12 22:47:15 -04:00
Michael Yang	6de5d032e1	implement loading ggml lora adapters through the modelfile	2023-08-10 09:23:39 -07:00
Michael Yang	fccf8d179f	partial decode ggml bin for more info	2023-08-10 09:23:10 -07:00

1 2

90 commits