ollama

Author	SHA1	Message	Date
frob	09035b71cd	Add missing BF16 tensor type. (#7193 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2024-10-14 17:06:35 -07:00
Michael Yang	bf612cd608	Merge pull request #6260 from ollama/mxyng/mem llama3.1 memory	2024-09-05 13:22:08 -07:00
Patrick Devine	0c819e167b	convert safetensor adapters into GGUF (#6327 )	2024-08-23 11:29:56 -07:00
Michael Yang	6ffb5cb017	add conversion for microsoft phi 3 mini/medium 4k, 128	2024-08-12 15:13:29 -07:00
Michael Yang	2003d60159	llama3.1 memory	2024-08-08 11:18:13 -07:00
Michael Yang	6b252918fb	update convert test to check result data	2024-07-31 10:59:38 -07:00
Michael Yang	5a739ff4cb	chatglm graph	2024-07-10 13:43:47 -07:00
Michael Yang	de2163dafd	gemma2 graph	2024-06-27 13:34:52 -07:00
Blake Mizerany	cb42e607c5	llm: speed up gguf decoding by a lot (#5246 ) Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.	2024-06-24 21:47:52 -07:00
Michael Yang	8e0641a9bf	handle asymmetric embedding KVs	2024-06-20 09:57:27 -07:00
Michael Yang	e873841cbb	deepseek v2 graph	2024-06-18 15:35:12 -07:00
Daniel Hiltgen	6fd04ca922	Improve multi-gpu handling at the limit Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block	2024-06-14 14:51:40 -07:00
Michael Yang	7bdcd1da94	Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order" This reverts commit `f5f245cc15`, reversing changes made to `94d37fdcae`. this change broke gguf v2 which is incorrectly detected as big endian	2024-06-11 15:56:17 -07:00
Michael Yang	620d5c569e	fix parsing big endian gguf	2024-06-08 12:35:26 -07:00
Michael Yang	9b6c2e6eb6	detect chat template from KV	2024-06-06 16:03:47 -07:00
Michael Yang	d51f15257c	Update llm/ggml.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2024-05-24 16:10:43 -07:00
Michael Yang	8f440d579a	fix q5_0, q5_1	2024-05-24 16:01:46 -07:00
Bruce MacDonald	d6f692ad1a	Add support for IQ1_S, IQ3_S, IQ2_S, IQ4_XS. IQ4_NL (#4322 ) Co-authored-by: ManniX-ITA <20623405+mann1x@users.noreply.github.com>	2024-05-23 13:21:49 -07:00
Michael Yang	171eb040fc	simplify safetensors reading	2024-05-21 11:28:22 -07:00
Michael Yang	1eb382da5a	add phi2 mem	2024-05-10 12:13:28 -07:00
Michael Yang	eeb695261f	skip if same quantization	2024-05-07 17:44:19 -07:00
Michael Yang	01811c176a	comments	2024-05-06 15:24:01 -07:00
Michael Yang	9685c34509	quantize any fp16/fp32 model - FROM /path/to/{safetensors,pytorch} - FROM /path/to/fp{16,32}.bin - FROM model:fp{16,32}	2024-05-06 15:24:01 -07:00
Michael Yang	435cc866a3	fix: mixtral graph	2024-04-22 17:19:44 -07:00
Michael Yang	3cf483fe48	add stablelm graph calculation	2024-04-17 13:57:19 -07:00
Michael Yang	a8b9b930b4	account for all non-repeating layers	2024-04-17 11:21:21 -07:00
Michael Yang	3397eff0cd	mixtral mem	2024-04-11 11:10:41 -07:00
Michael Yang	7e33a017c0	partial offloading	2024-04-10 11:37:20 -07:00
Michael Yang	8b2c10061c	refactor tensor query	2024-04-10 11:37:20 -07:00
Michael Yang	01f77ae25d	add command-r graph estimate	2024-04-04 14:07:24 -07:00
Michael Yang	12e923e158	update graph size estimate	2024-04-03 13:34:12 -07:00
Michael Yang	90f071c658	default head_kv to 1	2024-04-02 16:37:59 -07:00
Michael Yang	91b3e4d282	update memory calcualtions count each layer independently when deciding gpu offloading	2024-04-01 13:16:32 -07:00
Michael Yang	d338d70492	refactor model parsing	2024-04-01 13:16:15 -07:00
Patrick Devine	5a5efee46b	Add gemma safetensors conversion (#3250 ) Co-authored-by: Michael Yang <mxyng@pm.me>	2024-03-28 18:54:01 -07:00
Michael Yang	0085297928	refactor readseeker	2024-03-12 12:54:18 -07:00
Michael Yang	76bdebbadf	decode ggla	2024-03-08 15:46:25 -08:00
Patrick Devine	2c017ca441	Convert Safetensors to an Ollama model (#2824 )	2024-03-06 21:01:51 -08:00
Michael Yang	949d7b1c48	add gguf file types (#2532 )	2024-02-20 19:06:29 -05:00
Michael Yang	eaed6f8c45	add max context length check	2024-01-12 14:54:07 -08:00
Michael Yang	2bb2bdd5d4	fix lint	2024-01-09 09:36:58 -08:00
Jeffrey Morgan	08f1e18965	Offload layers to GPU based on new model size estimates (#1850 ) * select layers based on estimated model memory usage * always account for scratch vram * dont load +1 layers * better estmation for graph alloc * Update gpu/gpu_darwin.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go * add overhead for cuda memory * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * fix build error on linux * address comments --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2024-01-08 16:42:00 -05:00
Bruce MacDonald	811b1f03c8	deprecate ggml - remove ggml runner - automatically pull gguf models when ggml detected - tell users to update to gguf in the case automatic pull fails Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>	2023-12-19 09:05:46 -08:00
Jeffrey Morgan	d9a250e9b5	seek to end of file when decoding older model formats	2023-12-09 21:14:35 -05:00
Jeffrey Morgan	944519ed16	seek to eof for older model binaries	2023-12-09 20:48:57 -05:00
Michael Yang	72e7a49aa9	seek instead of copyn	2023-12-04 16:59:23 -08:00
Michael Yang	2cb0fa7d40	split from into one or more models	2023-12-04 16:59:23 -08:00
Michael Yang	b2816bca67	unnecessary ReadSeeker for DecodeGGML	2023-12-04 16:59:23 -08:00
Michael Yang	125d0a013a	ggufv3 ggufv3 adds support for big endianness, mainly for s390x architecture. while that's not currently supported for ollama, the change is simple. loosen version check to be more forward compatible. unless specified, gguf versions other v1 will be decoded into v2.	2023-10-23 09:35:49 -07:00
Michael Yang	c02c0cd483	starcoder	2023-10-02 19:56:51 -07:00

1 2

60 commits