ollama

Author	SHA1	Message	Date
Daniel Hiltgen	b05c9e83d9	Introduce GPU Overhead env var (#5922 ) Provide a mechanism for users to set aside an amount of VRAM on each GPU to make room for other applications they want to start after Ollama, or workaround memory prediction bugs	2024-09-05 13:46:35 -07:00
Michael Yang	8e0641a9bf	handle asymmetric embedding KVs	2024-06-20 09:57:27 -07:00
Daniel Hiltgen	359b15a597	Handle models with divergent layer sizes The recent refactoring of the memory prediction assumed all layers are the same size, but for some models (like deepseek-coder-v2) this is not the case, so our predictions were significantly off.	2024-06-18 11:05:34 -07:00
Daniel Hiltgen	7784ca33ce	Tighten up memory prediction logging Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.	2024-06-18 09:15:35 -07:00
Daniel Hiltgen	17df6520c8	Remove mmap related output calc logic	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	6f351bf586	review comments and coverage	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	6fd04ca922	Improve multi-gpu handling at the limit Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block	2024-06-14 14:51:40 -07:00
Michael Yang	6297f85606	gofmt, goimports	2024-06-04 13:20:24 -07:00
Michael Yang	e40145a39d	lint	2024-06-04 11:13:30 -07:00
Patrick Devine	4cc3be3035	Move envconfig and consolidate env vars (#4608 )	2024-05-24 14:57:15 -07:00
Michael Yang	1d359e737e	typo	2024-05-13 14:18:34 -07:00
Michael Yang	50b9056e09	count memory up to NumGPU	2024-05-13 14:13:10 -07:00
Jeffrey Morgan	bb6fd02298	Don't clamp ctx size in `PredictServerFit` (#4317 ) * dont clamp ctx size in `PredictServerFit` * minimum 4 context * remove context warning	2024-05-10 10:17:12 -07:00
Daniel Hiltgen	bee2f4a3b0	Record GPU usage information This records more GPU usage information for eventual UX inclusion.	2024-05-08 14:45:39 -07:00
Michael Yang	4736391bfb	llm: add minimum based on layer size	2024-05-06 17:04:19 -07:00
Daniel Hiltgen	f56aa20014	Centralize server config handling This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs	2024-05-05 16:49:50 -07:00
Jeffrey Morgan	f0c454ab57	gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068 )	2024-05-01 11:46:03 -04:00
Michael Yang	f81f308118	fix gemma, command-r layer weights	2024-04-26 15:00:55 -07:00
Michael Yang	7bb7cb8a60	only count output tensors	2024-04-25 15:24:08 -07:00
Daniel Hiltgen	5445aaa94e	Add back memory escape valve If we get our predictions wrong, this can be used to set a lower memory limit as a workaround. Recent multi-gpu refactoring accidentally removed it, so this adds it back.	2024-04-23 17:09:02 -07:00
Daniel Hiltgen	34b9db5afc	Request and model concurrency This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.	2024-04-22 19:29:12 -07:00

21 commits