Commit graph

9 commits

Author SHA1 Message Date
Daniel Hiltgen
34b9db5afc Request and model concurrency
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
2024-04-22 19:29:12 -07:00
Michael Yang
7e33a017c0 partial offloading 2024-04-10 11:37:20 -07:00
Michael Yang
91b3e4d282 update memory calcualtions
count each layer independently when deciding gpu offloading
2024-04-01 13:16:32 -07:00
Michael Yang
424d53ac70 progress: fix bar rate 2023-11-28 11:44:56 -08:00
Jeffrey Morgan
93a108214c only show decimal points for smaller file size numbers 2023-11-20 10:58:19 -05:00
Michael Yang
9f04e5a8ea format bytes 2023-11-17 10:06:19 -08:00
Michael Yang
01ea6002c4 replace go-humanize with format.HumanBytes 2023-11-14 14:57:41 -08:00
Michael Yang
92189a5855 fix memory check 2023-10-13 14:47:29 -07:00
Michael Yang
b599946b74 add format bytes 2023-10-11 14:08:23 -07:00