Commit graph

25 commits

Author SHA1 Message Date
Michael Yang
d07cf41a97 refactor kv estimation 2024-11-01 16:23:55 -07:00
Patrick Devine
c7cb0f0602
image processing for llama3.2 (#6963)
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>
2024-10-18 16:12:35 -07:00
Daniel Hiltgen
05cd82ef94
Rename gpu package discover (#7143)
Cleaning up go package naming
2024-10-16 17:45:00 -07:00
Daniel Hiltgen
56318fb365
Improve logging on GPU too small (#6666)
When we determine a GPU is too small for any layers, it's not always clear why.
This will help troubleshoot those scenarios.
2024-09-06 08:29:36 -07:00
Daniel Hiltgen
b05c9e83d9
Introduce GPU Overhead env var (#5922)
Provide a mechanism for users to set aside an amount of VRAM on each GPU
to make room for other applications they want to start after Ollama, or workaround
memory prediction bugs
2024-09-05 13:46:35 -07:00
Michael Yang
8e0641a9bf handle asymmetric embedding KVs 2024-06-20 09:57:27 -07:00
Daniel Hiltgen
359b15a597 Handle models with divergent layer sizes
The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.
2024-06-18 11:05:34 -07:00
Daniel Hiltgen
7784ca33ce Tighten up memory prediction logging
Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.
2024-06-18 09:15:35 -07:00
Daniel Hiltgen
17df6520c8 Remove mmap related output calc logic 2024-06-14 14:55:50 -07:00
Daniel Hiltgen
6f351bf586 review comments and coverage 2024-06-14 14:55:50 -07:00
Daniel Hiltgen
6fd04ca922 Improve multi-gpu handling at the limit
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
2024-06-14 14:51:40 -07:00
Michael Yang
6297f85606 gofmt, goimports 2024-06-04 13:20:24 -07:00
Michael Yang
e40145a39d lint 2024-06-04 11:13:30 -07:00
Patrick Devine
4cc3be3035
Move envconfig and consolidate env vars (#4608) 2024-05-24 14:57:15 -07:00
Michael Yang
1d359e737e typo 2024-05-13 14:18:34 -07:00
Michael Yang
50b9056e09 count memory up to NumGPU 2024-05-13 14:13:10 -07:00
Jeffrey Morgan
bb6fd02298
Don't clamp ctx size in PredictServerFit (#4317)
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
2024-05-10 10:17:12 -07:00
Daniel Hiltgen
bee2f4a3b0 Record GPU usage information
This records more GPU usage information for eventual UX inclusion.
2024-05-08 14:45:39 -07:00
Michael Yang
4736391bfb llm: add minimum based on layer size 2024-05-06 17:04:19 -07:00
Daniel Hiltgen
f56aa20014 Centralize server config handling
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
2024-05-05 16:49:50 -07:00
Jeffrey Morgan
f0c454ab57
gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068) 2024-05-01 11:46:03 -04:00
Michael Yang
f81f308118 fix gemma, command-r layer weights 2024-04-26 15:00:55 -07:00
Michael Yang
7bb7cb8a60 only count output tensors 2024-04-25 15:24:08 -07:00
Daniel Hiltgen
5445aaa94e Add back memory escape valve
If we get our predictions wrong, this can be used to
set a lower memory limit as a workaround.  Recent multi-gpu
refactoring accidentally removed it, so this adds it back.
2024-04-23 17:09:02 -07:00
Daniel Hiltgen
34b9db5afc Request and model concurrency
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
2024-04-22 19:29:12 -07:00