Commit graph

26 commits

Author SHA1 Message Date
Jeffrey Morgan
152fc202f5
llm: update llama.cpp commit to 7c26775 (#4896)
* llm: update llama.cpp submodule to `7c26775`

* disable `LLAMA_BLAS` for now

* `-DLLAMA_OPENMP=off`
2024-06-17 15:56:16 -04:00
Jeffrey Morgan
ce0dc33cb8
llm: patch to fix qwen 2 temporarily on nvidia (#4897) 2024-06-06 23:14:33 -07:00
Jeffrey Morgan
22f5c12ced
Update llama.cpp submodule to 5921b8f0 (#4731)
* update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603`

* add patch
2024-05-30 16:20:22 -07:00
Michael Yang
714adb8bd1
bump (#4597) 2024-05-23 14:16:26 -07:00
Daniel Hiltgen
b37b496a12 Wire up load progress
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Jeffrey Morgan
583c1f472c
update llama.cpp submodule to 614d3b9 (#4414) 2024-05-16 13:53:09 -07:00
Jeffrey Morgan
1b0e6c9c0e
Fix llava models not working after first request (#4164)
* fix llava models not working after first request

* individual requests only for llava models
2024-05-05 20:50:31 -07:00
Daniel Hiltgen
85801317d1 Fix clip log import 2024-04-26 09:43:46 -07:00
jmorganca
ddf5c09a9b use matrix multiplcation kernels in more cases 2024-04-25 13:58:54 -07:00
Daniel Hiltgen
0035e31af8 Bump to b2581 2024-04-02 11:53:07 -07:00
Daniel Hiltgen
43799532c1 Bump llama.cpp to b2474
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Michael Yang
291c663865 fix: clip memory leak 2024-03-14 13:12:42 -07:00
Jeffrey Morgan
e72c567cfd
restore locale patch (#3091) 2024-03-12 22:08:13 -07:00
Bruce MacDonald
b80661e8c7
relay load model errors to the client (#3065) 2024-03-11 16:48:27 -04:00
Jeffrey Morgan
369eda65f5
update llama.cpp submodule to ceca1ae (#3064) 2024-03-11 12:57:48 -07:00
Jeffrey Morgan
41b00b9856 fix 03-locale.diff 2024-03-10 16:21:05 -07:00
Jeffrey Morgan
908005d90b
patch: use default locale in wpm tokenizer (#3034) 2024-03-09 21:12:12 -08:00
Jeffrey Morgan
1ffb1e2874
update llama.cpp submodule to 77d1ac7 (#3030) 2024-03-09 15:55:34 -08:00
Jeffrey Morgan
0e4669b04f
update llama.cpp submodule to 6cdabe6 (#2999) 2024-03-08 00:26:20 -08:00
Jeffrey Morgan
21347e1ed6
update llama.cpp submodule to c29af7e (#2868) 2024-03-01 15:26:04 -08:00
Jeffrey Morgan
4613a080e7
update llama.cpp submodule to 66c1968f7 (#2618) 2024-02-20 17:42:31 -05:00
Daniel Hiltgen
fc39a6cd7a Fix cuda leaks
This should resolve the problem where we don't fully unload from the GPU
when we go idle.
2024-02-18 18:37:20 -08:00
Jeffrey Morgan
26b13fc33c
patch: always add token to cache_tokens (#2459) 2024-02-12 08:10:16 -08:00
Daniel Hiltgen
de76b95dd4 Bump llama.cpp to b2081 2024-02-06 12:06:43 -08:00
Daniel Hiltgen
72b12c3be7 Bump llama.cpp to b1999
This requires an upstream change to support graceful termination,
carried as a patch.
2024-01-30 16:52:12 -08:00
Jeffrey Morgan
a64570dcae
Fix clearing kv cache between requests with the same prompt (#2186)
* Fix clearing kv cache between requests with the same prompt

* fix powershell script
2024-01-25 13:46:20 -08:00