Michael Yang
504a410f02
llm: add solar pro (preview) ( #6846 )
2024-09-17 18:11:26 -07:00
Michael Yang
7bd7b02712
make patches git am-able
...
raw diffs can be applied using `git apply` but not with `git am`. git
patches, e.g. through `git format-patch` are both apply-able and am-able
2024-09-17 15:26:40 -07:00
Jeffrey Morgan
5e2653f9fe
llm: update llama.cpp commit to 8962422 ( #6618 )
2024-09-03 21:12:39 -04:00
Daniel Hiltgen
90ca84172c
Fix embeddings memory corruption ( #6467 )
...
* Fix embeddings memory corruption
The patch was leading to a buffer overrun corruption. Once removed though, parallism
in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count. To
work around this, only use slot 0 for embeddings.
* Fix embed integration test assumption
The token eval count has changed with recent llama.cpp bumps (0.3.5+)
2024-08-22 14:51:42 -07:00
Jeffrey Morgan
e04c7012c2
update llama.cpp submodule to 1e6f6554
( #6208 )
2024-08-06 15:11:45 -04:00
Michael Yang
0f3271db88
patches: phi3 default sliding window attention
2024-07-31 14:58:34 -07:00
jmorganca
afa8d6e9d5
patch gemma support
2024-07-30 18:07:29 -07:00
Jeffrey Morgan
68ee42f995
update llama.cpp submodule to 6eeaeba1
( #6039 )
2024-07-29 13:20:26 -07:00
Jeffrey Morgan
f2a96c7d77
llm: keep patch for llama 3 rope factors ( #5987 )
2024-07-26 15:20:52 -07:00
Jeffrey Morgan
bbf8f102ee
Revert "llm(llama): pass rope factors ( #5924 )" ( #5963 )
...
This reverts commit bb46bbcf5e
.
2024-07-25 18:24:55 -04:00
Michael Yang
bb46bbcf5e
llm(llama): pass rope factors ( #5924 )
2024-07-24 16:05:59 -04:00
Jeffrey Morgan
f8fedbda20
Update llama.cpp submodule commit to d94c6e0c
( #5805 )
2024-07-22 12:42:00 -04:00
Jeffrey Morgan
5534f2cc6a
llm: consider head_dim
in llama arch ( #5817 )
2024-07-20 21:48:12 -04:00
Jeffrey Morgan
1475eab95f
add patch for tekken ( #5807 )
2024-07-20 13:41:21 -04:00
Jeffrey Morgan
571dc61955
Update llama.cpp submodule to a8db2a9c
( #5530 )
2024-07-07 13:03:09 -04:00
Jeffrey Morgan
8f8e736b13
update llama.cpp submodule to d7fd29f
( #5475 )
2024-07-05 13:25:58 -04:00
Jeffrey Morgan
e9188e971a
Fix assert on small embedding inputs ( #5491 )
...
* Fix assert on small embedding inputs
* Update llm/patches/09-pooling.diff
2024-07-05 11:20:57 -04:00
Daniel Hiltgen
6298f49816
Fix clip model loading with unicode paths
...
On windows, if the model dir contained unicode characters
clip models would fail to load. This fixes the file name
handling in clip.cpp to support utf16 on windows.
2024-07-03 12:46:36 -07:00
Jeffrey Morgan
4d311eb731
llm: architecture patch ( #5316 )
2024-06-26 21:38:12 -07:00
Jeffrey Morgan
152fc202f5
llm: update llama.cpp commit to 7c26775
( #4896 )
...
* llm: update llama.cpp submodule to `7c26775`
* disable `LLAMA_BLAS` for now
* `-DLLAMA_OPENMP=off`
2024-06-17 15:56:16 -04:00
Jeffrey Morgan
ce0dc33cb8
llm: patch to fix qwen 2 temporarily on nvidia ( #4897 )
2024-06-06 23:14:33 -07:00
Jeffrey Morgan
22f5c12ced
Update llama.cpp submodule to 5921b8f0
( #4731 )
...
* update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603`
* add patch
2024-05-30 16:20:22 -07:00
Michael Yang
714adb8bd1
bump ( #4597 )
2024-05-23 14:16:26 -07:00
Daniel Hiltgen
b37b496a12
Wire up load progress
...
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Jeffrey Morgan
583c1f472c
update llama.cpp submodule to 614d3b9
( #4414 )
2024-05-16 13:53:09 -07:00
Jeffrey Morgan
1b0e6c9c0e
Fix llava models not working after first request ( #4164 )
...
* fix llava models not working after first request
* individual requests only for llava models
2024-05-05 20:50:31 -07:00
Daniel Hiltgen
85801317d1
Fix clip log import
2024-04-26 09:43:46 -07:00
jmorganca
ddf5c09a9b
use matrix multiplcation kernels in more cases
2024-04-25 13:58:54 -07:00
Daniel Hiltgen
0035e31af8
Bump to b2581
2024-04-02 11:53:07 -07:00
Daniel Hiltgen
43799532c1
Bump llama.cpp to b2474
...
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Michael Yang
291c663865
fix: clip memory leak
2024-03-14 13:12:42 -07:00
Jeffrey Morgan
e72c567cfd
restore locale patch ( #3091 )
2024-03-12 22:08:13 -07:00
Bruce MacDonald
b80661e8c7
relay load model errors to the client ( #3065 )
2024-03-11 16:48:27 -04:00
Jeffrey Morgan
369eda65f5
update llama.cpp submodule to ceca1ae
( #3064 )
2024-03-11 12:57:48 -07:00
Jeffrey Morgan
41b00b9856
fix 03-locale.diff
2024-03-10 16:21:05 -07:00
Jeffrey Morgan
908005d90b
patch: use default locale in wpm tokenizer ( #3034 )
2024-03-09 21:12:12 -08:00
Jeffrey Morgan
1ffb1e2874
update llama.cpp submodule to 77d1ac7
( #3030 )
2024-03-09 15:55:34 -08:00
Jeffrey Morgan
0e4669b04f
update llama.cpp submodule to 6cdabe6
( #2999 )
2024-03-08 00:26:20 -08:00
Jeffrey Morgan
21347e1ed6
update llama.cpp submodule to c29af7e
( #2868 )
2024-03-01 15:26:04 -08:00
Jeffrey Morgan
4613a080e7
update llama.cpp submodule to 66c1968f7
( #2618 )
2024-02-20 17:42:31 -05:00
Daniel Hiltgen
fc39a6cd7a
Fix cuda leaks
...
This should resolve the problem where we don't fully unload from the GPU
when we go idle.
2024-02-18 18:37:20 -08:00
Jeffrey Morgan
26b13fc33c
patch: always add token to cache_tokens ( #2459 )
2024-02-12 08:10:16 -08:00
Daniel Hiltgen
de76b95dd4
Bump llama.cpp to b2081
2024-02-06 12:06:43 -08:00
Daniel Hiltgen
72b12c3be7
Bump llama.cpp to b1999
...
This requires an upstream change to support graceful termination,
carried as a patch.
2024-01-30 16:52:12 -08:00
Jeffrey Morgan
a64570dcae
Fix clearing kv cache between requests with the same prompt ( #2186 )
...
* Fix clearing kv cache between requests with the same prompt
* fix powershell script
2024-01-25 13:46:20 -08:00