ollama

Author	SHA1	Message	Date
Michael Yang	1eb382da5a	add phi2 mem	2024-05-10 12:13:28 -07:00
Michael Yang	eeb695261f	skip if same quantization	2024-05-07 17:44:19 -07:00
Michael Yang	01811c176a	comments	2024-05-06 15:24:01 -07:00
Michael Yang	9685c34509	quantize any fp16/fp32 model - FROM /path/to/{safetensors,pytorch} - FROM /path/to/fp{16,32}.bin - FROM model:fp{16,32}	2024-05-06 15:24:01 -07:00
Michael Yang	435cc866a3	fix: mixtral graph	2024-04-22 17:19:44 -07:00
Michael Yang	3cf483fe48	add stablelm graph calculation	2024-04-17 13:57:19 -07:00
Michael Yang	a8b9b930b4	account for all non-repeating layers	2024-04-17 11:21:21 -07:00
Michael Yang	3397eff0cd	mixtral mem	2024-04-11 11:10:41 -07:00
Michael Yang	7e33a017c0	partial offloading	2024-04-10 11:37:20 -07:00
Michael Yang	8b2c10061c	refactor tensor query	2024-04-10 11:37:20 -07:00
Michael Yang	01f77ae25d	add command-r graph estimate	2024-04-04 14:07:24 -07:00
Michael Yang	12e923e158	update graph size estimate	2024-04-03 13:34:12 -07:00
Michael Yang	90f071c658	default head_kv to 1	2024-04-02 16:37:59 -07:00
Michael Yang	91b3e4d282	update memory calcualtions count each layer independently when deciding gpu offloading	2024-04-01 13:16:32 -07:00
Michael Yang	d338d70492	refactor model parsing	2024-04-01 13:16:15 -07:00
Patrick Devine	5a5efee46b	Add gemma safetensors conversion (#3250 ) Co-authored-by: Michael Yang <mxyng@pm.me>	2024-03-28 18:54:01 -07:00
Michael Yang	0085297928	refactor readseeker	2024-03-12 12:54:18 -07:00
Michael Yang	76bdebbadf	decode ggla	2024-03-08 15:46:25 -08:00
Patrick Devine	2c017ca441	Convert Safetensors to an Ollama model (#2824 )	2024-03-06 21:01:51 -08:00
Michael Yang	949d7b1c48	add gguf file types (#2532 )	2024-02-20 19:06:29 -05:00
Michael Yang	eaed6f8c45	add max context length check	2024-01-12 14:54:07 -08:00
Michael Yang	2bb2bdd5d4	fix lint	2024-01-09 09:36:58 -08:00
Jeffrey Morgan	08f1e18965	Offload layers to GPU based on new model size estimates (#1850 ) * select layers based on estimated model memory usage * always account for scratch vram * dont load +1 layers * better estmation for graph alloc * Update gpu/gpu_darwin.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go * add overhead for cuda memory * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * fix build error on linux * address comments --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2024-01-08 16:42:00 -05:00
Bruce MacDonald	811b1f03c8	deprecate ggml - remove ggml runner - automatically pull gguf models when ggml detected - tell users to update to gguf in the case automatic pull fails Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>	2023-12-19 09:05:46 -08:00
Jeffrey Morgan	d9a250e9b5	seek to end of file when decoding older model formats	2023-12-09 21:14:35 -05:00
Jeffrey Morgan	944519ed16	seek to eof for older model binaries	2023-12-09 20:48:57 -05:00
Michael Yang	72e7a49aa9	seek instead of copyn	2023-12-04 16:59:23 -08:00
Michael Yang	2cb0fa7d40	split from into one or more models	2023-12-04 16:59:23 -08:00
Michael Yang	b2816bca67	unnecessary ReadSeeker for DecodeGGML	2023-12-04 16:59:23 -08:00
Michael Yang	125d0a013a	ggufv3 ggufv3 adds support for big endianness, mainly for s390x architecture. while that's not currently supported for ollama, the change is simple. loosen version check to be more forward compatible. unless specified, gguf versions other v1 will be decoded into v2.	2023-10-23 09:35:49 -07:00
Michael Yang	c02c0cd483	starcoder	2023-10-02 19:56:51 -07:00
Bruce MacDonald	86279f4ae3	unbound max num gpu layers (#591 ) --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-25 18:36:46 -04:00
Bruce MacDonald	4cba75efc5	remove tmp directories created by previous servers (#559 ) * remove tmp directories created by previous servers * clean up on server stop * Update routes.go * Update server/routes.go Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> * create top-level temp ollama dir * check file exists before creating --------- Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-21 20:38:49 +01:00
Bruce MacDonald	66003e1d05	subprocess improvements (#524 ) * subprocess improvements - increase start-up timeout - when runner fails to start fail rather than timing out - try runners in order rather than choosing 1 runner - embed metal runner in metal dir rather than gpu - refactor logging and error messages * Update llama.go * Update llama.go * simplify by using glob	2023-09-18 15:16:32 -04:00
Bruce MacDonald	2540c9181c	support for packaging in multiple cuda runners (#509 ) * enable packaging multiple cuda versions * use nvcc cuda version if available --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-14 15:08:13 -04:00
Michael Yang	7dee25a07f	fix falcon decode get model and file type from bin file	2023-09-12 12:34:53 -07:00
Bruce MacDonald	09dd2aeff9	GGUF support (#441 )	2023-09-07 13:55:37 -04:00
Michael Yang	b1cececb8e	add 34b model type	2023-08-24 10:35:44 -07:00
Michael Yang	a894cc792d	model and file type as strings	2023-08-17 12:08:04 -07:00
Michael Yang	6ed991c8e2	ggml: fix off by one error remove used Unknown FileType	2023-08-11 10:45:22 -07:00
Michael Yang	fccf8d179f	partial decode ggml bin for more info	2023-08-10 09:23:10 -07:00

41 commits