ollama

Author	SHA1	Message	Date
Jeffrey Morgan	f0c454ab57	gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068 )	2024-05-01 11:46:03 -04:00
Michael Yang	f81f308118	fix gemma, command-r layer weights	2024-04-26 15:00:55 -07:00
Michael Yang	7bb7cb8a60	only count output tensors	2024-04-25 15:24:08 -07:00
Daniel Hiltgen	5445aaa94e	Add back memory escape valve If we get our predictions wrong, this can be used to set a lower memory limit as a workaround. Recent multi-gpu refactoring accidentally removed it, so this adds it back.	2024-04-23 17:09:02 -07:00
Daniel Hiltgen	34b9db5afc	Request and model concurrency This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.	2024-04-22 19:29:12 -07:00