ollama

Author	SHA1	Message	Date
Bruce MacDonald	4cba75efc5	remove tmp directories created by previous servers (#559 ) * remove tmp directories created by previous servers * clean up on server stop * Update routes.go * Update server/routes.go Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> * create top-level temp ollama dir * check file exists before creating --------- Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-21 20:38:49 +01:00
Bruce MacDonald	1255bc9b45	only package 11.8 runner	2023-09-20 20:00:41 +01:00
Bruce MacDonald	4e8be787c7	pack in cuda libs	2023-09-20 17:40:42 +01:00
Bruce MacDonald	66003e1d05	subprocess improvements (#524 ) * subprocess improvements - increase start-up timeout - when runner fails to start fail rather than timing out - try runners in order rather than choosing 1 runner - embed metal runner in metal dir rather than gpu - refactor logging and error messages * Update llama.go * Update llama.go * simplify by using glob	2023-09-18 15:16:32 -04:00
Bruce MacDonald	2540c9181c	support for packaging in multiple cuda runners (#509 ) * enable packaging multiple cuda versions * use nvcc cuda version if available --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-14 15:08:13 -04:00
Michael Yang	7dee25a07f	fix falcon decode get model and file type from bin file	2023-09-12 12:34:53 -07:00
Bruce MacDonald	f221637053	first pass at linux gpu support (#454 ) * linux gpu support * handle multiple gpus * add cuda docker image (#488) --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-12 11:04:35 -04:00
Bruce MacDonald	09dd2aeff9	GGUF support (#441 )	2023-09-07 13:55:37 -04:00
Bruce MacDonald	42998d797d	subprocess llama.cpp server (#401 ) * remove c code * pack llama.cpp * use request context for llama_cpp * let llama_cpp decide the number of threads to use * stop llama runner when app stops * remove sample count and duration metrics * use go generate to get libraries * tmp dir for running llm	2023-08-30 16:35:03 -04:00
Quinn Slack	f4432e1dba	treat stop as stop sequences, not exact tokens (#442 ) The `stop` option to the generate API is a list of sequences that should cause generation to stop. Although these are commonly called "stop tokens", they do not necessarily correspond to LLM tokens (per the LLM's tokenizer). For example, if the caller sends a generate request with `"stop":["\n"]`, then generation should stop on any token containing `\n` (and trim `\n` from the output), not just if the token exactly matches `\n`. If `stop` were interpreted strictly as LLM tokens, then it would require callers of the generate API to know the LLM's tokenizer and enumerate many tokens in the `stop` list. Fixes https://github.com/jmorganca/ollama/issues/295.	2023-08-30 11:53:42 -04:00
Michael Yang	5ca05c2e88	fix ModelType()	2023-08-18 11:23:38 -07:00
Michael Yang	a894cc792d	model and file type as strings	2023-08-17 12:08:04 -07:00
Bruce MacDonald	4b2d366c37	Update llama.go	2023-08-14 12:55:50 -03:00
Bruce MacDonald	56fd4e4ef2	log embedding eval timing	2023-08-14 12:51:31 -03:00
Jeffrey Morgan	22885aeaee	update `llama.cpp` to `f64d44a`	2023-08-12 22:47:15 -04:00
Michael Yang	6de5d032e1	implement loading ggml lora adapters through the modelfile	2023-08-10 09:23:39 -07:00
Michael Yang	fccf8d179f	partial decode ggml bin for more info	2023-08-10 09:23:10 -07:00

17 commits