Daniel Hiltgen
c4209d6d21
Report better warning on client closed abort of load
...
If the client closes the connection before we finish loading the model
we abort, so lets make the log message clearer why to help users
understand this failure mode
2024-05-25 09:23:28 -07:00
Michael Yang
714adb8bd1
bump ( #4597 )
2024-05-23 14:16:26 -07:00
Daniel Hiltgen
95b1133d0c
Merge pull request #4547 from dhiltgen/load_progress
...
Wire up load progress
2024-05-23 14:06:02 -07:00
Daniel Hiltgen
b37b496a12
Wire up load progress
...
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Bruce MacDonald
d6f692ad1a
Add support for IQ1_S, IQ3_S, IQ2_S, IQ4_XS. IQ4_NL ( #4322 )
...
Co-authored-by: ManniX-ITA <20623405+mann1x@users.noreply.github.com>
2024-05-23 13:21:49 -07:00
Jeffrey Morgan
38255d2af1
Use flash attention flag for now ( #4580 )
...
* put flash attention behind flag for now
* add test
* remove print
* up timeout for sheduler tests
2024-05-22 21:52:09 -07:00
Michael Yang
171eb040fc
simplify safetensors reading
2024-05-21 11:28:22 -07:00
Michael Yang
bbbd9f20f3
cleanup
2024-05-20 16:13:57 -07:00
Michael Yang
547132e820
bpe pretokenizer
2024-05-20 16:13:57 -07:00
Patrick Devine
c8cf0d94ed
llama3 conversion
2024-05-20 16:13:57 -07:00
jmorganca
5cab13739e
set llama.cpp submodule commit to 614d3b9
2024-05-20 15:28:17 -07:00
Josh Yan
8aadad9c72
updated updateURL
2024-05-20 15:24:32 -07:00
Sam
e15307fdf4
feat: add support for flash_attn ( #4120 )
...
* feat: enable flash attention if supported
* feat: enable flash attention if supported
* feat: enable flash attention if supported
* feat: add flash_attn support
2024-05-20 13:36:03 -07:00
Jeffrey Morgan
583c1f472c
update llama.cpp submodule to 614d3b9
( #4414 )
2024-05-16 13:53:09 -07:00
Daniel Hiltgen
c48c1d7c46
Port cuda/rocm skip build vars to linux
...
Windows already implements these, carry over to linux.
2024-05-15 15:56:43 -07:00
Patrick Devine
d1692fd3e0
fix the cpu estimatedTotal memory + get the expiry time for loading models ( #4461 )
2024-05-15 15:43:16 -07:00
Daniel Hiltgen
853ae490e1
Sanitize the env var debug log
...
Only dump env vars we care about in the logs
2024-05-15 14:42:57 -07:00
Michael Yang
0e331c7168
Merge pull request #4328 from ollama/mxyng/mem
...
count memory up to NumGPU if set by user
2024-05-14 13:47:44 -07:00
Patrick Devine
6845988807
Ollama ps
command for showing currently loaded models ( #4327 )
2024-05-13 17:17:36 -07:00
Michael Yang
1d359e737e
typo
2024-05-13 14:18:34 -07:00
Michael Yang
50b9056e09
count memory up to NumGPU
2024-05-13 14:13:10 -07:00
jmorganca
92ca2cca95
Revert "only forward some env vars"
...
This reverts commit ce3b212d12
.
2024-05-10 22:53:21 -07:00
Daniel Hiltgen
c4014e73a2
Fall back to CPU runner with zero layers
2024-05-10 15:09:48 -07:00
Michael Yang
1eb382da5a
add phi2 mem
2024-05-10 12:13:28 -07:00
Jeffrey Morgan
bb6fd02298
Don't clamp ctx size in PredictServerFit
( #4317 )
...
* dont clamp ctx size in `PredictServerFit`
* minimum 4 context
* remove context warning
2024-05-10 10:17:12 -07:00
Michael Yang
cf442cd57e
fix typo
2024-05-09 16:23:37 -07:00
Michael Yang
ce3b212d12
only forward some env vars
2024-05-09 15:16:09 -07:00
Michael Yang
58876091f7
log clean up
2024-05-09 14:55:36 -07:00
Daniel Hiltgen
d0425f26cf
Merge pull request #4294 from dhiltgen/harden_subprocess_reaping
...
Harden subprocess reaping
2024-05-09 14:02:16 -07:00
Bruce MacDonald
cfa84b8470
add done_reason to the api ( #4235 )
2024-05-09 13:30:14 -07:00
Daniel Hiltgen
84ac7ce139
Refine subprocess reaping
2024-05-09 11:21:31 -07:00
Daniel Hiltgen
920a4b0794
Merge remote-tracking branch 'upstream/main' into pr3702
2024-05-08 16:44:35 -07:00
Daniel Hiltgen
ee49844d09
Merge pull request #4153 from dhiltgen/gpu_verbose_response
...
Add GPU usage
2024-05-08 16:39:11 -07:00
Daniel Hiltgen
8a516ac862
Merge pull request #4241 from dhiltgen/fix_tmp_override
...
Detect noexec and report a better error
2024-05-08 15:34:22 -07:00
Daniel Hiltgen
bee2f4a3b0
Record GPU usage information
...
This records more GPU usage information for eventual UX inclusion.
2024-05-08 14:45:39 -07:00
Michael Yang
eeb695261f
skip if same quantization
2024-05-07 17:44:19 -07:00
Daniel Hiltgen
72700279e2
Detect noexec and report a better error
...
This will bubble up a much more informative error message if noexec
is preventing us from running the subprocess
2024-05-07 16:46:15 -07:00
Michael Yang
1e0a669f75
Merge pull request #3682 from ollama/mxyng/quantize-all-the-things
...
quantize any fp16/fp32 model
2024-05-07 15:20:49 -07:00
Michael Yang
4736391bfb
llm: add minimum based on layer size
2024-05-06 17:04:19 -07:00
Michael Yang
01811c176a
comments
2024-05-06 15:24:01 -07:00
Michael Yang
9685c34509
quantize any fp16/fp32 model
...
- FROM /path/to/{safetensors,pytorch}
- FROM /path/to/fp{16,32}.bin
- FROM model:fp{16,32}
2024-05-06 15:24:01 -07:00
Daniel Hiltgen
380378cc80
Use our libraries first
...
Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly
2024-05-06 14:23:29 -07:00
Jeffrey Morgan
ed740a2504
Fix no slots available
error with concurrent requests ( #4160 )
2024-05-06 14:22:53 -07:00
Jeffrey Morgan
1b0e6c9c0e
Fix llava models not working after first request ( #4164 )
...
* fix llava models not working after first request
* individual requests only for llava models
2024-05-05 20:50:31 -07:00
Daniel Hiltgen
f56aa20014
Centralize server config handling
...
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
2024-05-05 16:49:50 -07:00
Michael Yang
44869c59d6
omit prompt and generate settings from final response
2024-05-03 17:00:02 -07:00
Mark Ward
321d57e1a0
Removing go routine calling .wait from load.
2024-05-01 18:51:10 +00:00
Mark Ward
ba26c7aa00
it will always return an error due to Kill() discarding Wait() errors
2024-05-01 18:51:10 +00:00
Mark Ward
63c763685f
log when the waiting for the process to stop to help debug when other tasks execute during this wait.
...
expire timer clear the timer reference because it will not be reused.
close will clean up expireTimer if calling code has not already done this.
2024-05-01 18:51:10 +00:00
Mark Ward
948114e3e3
fix sched to wait for the runner to terminate to ensure following vram check will be more accurate
2024-05-01 18:51:10 +00:00