royjhan
b9f5e16c80
Introduce /api/embed
endpoint supporting batch embedding ( #5127 )
...
* Initial Batch Embedding
* Revert "Initial Batch Embedding"
This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29.
* Initial Draft
* mock up notes
* api/embed draft
* add server function
* check normalization
* clean up
* normalization
* playing around with truncate stuff
* Truncation
* Truncation
* move normalization to go
* Integration Test Template
* Truncation Integration Tests
* Clean up
* use float32
* move normalize
* move normalize test
* refactoring
* integration float32
* input handling and handler testing
* Refactoring of legacy and new
* clear comments
* merge conflicts
* touches
* embedding type 64
* merge conflicts
* fix hanging on single string
* refactoring
* test values
* set context length
* clean up
* testing clean up
* testing clean up
* remove function closure
* Revert "remove function closure"
This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787.
* remove function closure
* remove redundant error check
* clean up
* more clean up
* clean up
2024-07-15 12:14:24 -07:00
Jeffrey Morgan
d8def1ff94
llm: allow gemma 2 to context shift ( #5534 )
2024-07-07 13:41:51 -04:00
Jeffrey Morgan
0e09c380fc
llm: print caching notices in debug only ( #5533 )
2024-07-07 12:38:04 -04:00
Jeffrey Morgan
d89454de80
Use slot with cached prompt instead of least recently used ( #5492 )
...
* Use common prefix to select slot
* actually report `longest`
2024-07-05 12:32:47 -04:00
royjhan
3b5a4a77f3
Return Correct Prompt Eval Count Regardless of Cache Prompt ( #5371 )
...
* openai compatibility
* Revert "openai compatibility"
This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c.
* remove erroneous subtraction of prompt cache
2024-07-03 13:46:23 -07:00
Jeffrey Morgan
717f7229eb
Do not shift context for sliding window models ( #5368 )
...
* Do not shift context for sliding window models
* truncate prompt > 2/3 tokens
* only target gemma2
2024-06-28 19:39:31 -07:00
Michael Yang
9d91e5e587
remove confusing log message
2024-06-19 11:14:11 -07:00
Daniel Hiltgen
fb9cdfa723
Fix server.cpp for the new cuda build macros
2024-06-14 14:51:40 -07:00
Jeffrey Morgan
ead259d877
llm: fix seed value not being applied to requests ( #4986 )
2024-06-11 14:24:41 -07:00
Jeffrey Morgan
34f142797a
llm: always add bos token to prompt ( #4941 )
...
* fix embedding by adding fixes from llama.cpp upstream
* remove assert
---------
Co-authored-by: Jesper Ek <deadbeef84@gmail.com>
2024-06-08 18:47:10 -07:00
Michael Yang
829ff87bd1
revert tokenize ffi ( #4761 )
...
* Revert "use `int32_t` for call to tokenize (#4738 )"
This reverts commit 763bb65dbb
.
* Revert "vocab only"
This reverts commit bf54c845e9
.
* Revert "use ffi for tokenizing/detokenizing"
This reverts commit 26a00a0410
.
2024-05-31 18:54:21 -07:00
Michael Yang
de781b37c8
rm unused infill
2024-05-29 11:26:47 -07:00
Michael Yang
3e21799377
rm unused system prompt
2024-05-29 11:26:47 -07:00
Michael Yang
26a00a0410
use ffi for tokenizing/detokenizing
2024-05-29 11:26:47 -07:00
Michael Yang
714adb8bd1
bump ( #4597 )
2024-05-23 14:16:26 -07:00
Daniel Hiltgen
b37b496a12
Wire up load progress
...
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Sam
e15307fdf4
feat: add support for flash_attn ( #4120 )
...
* feat: enable flash attention if supported
* feat: enable flash attention if supported
* feat: enable flash attention if supported
* feat: add flash_attn support
2024-05-20 13:36:03 -07:00
Michael Yang
58876091f7
log clean up
2024-05-09 14:55:36 -07:00
Daniel Hiltgen
920a4b0794
Merge remote-tracking branch 'upstream/main' into pr3702
2024-05-08 16:44:35 -07:00
Michael Yang
44869c59d6
omit prompt and generate settings from final response
2024-05-03 17:00:02 -07:00
jmorganca
fcf4d60eee
llm: add back check for empty token cache
2024-04-30 17:38:44 -04:00
Jeffrey Morgan
18d9a7e1f1
update llama.cpp submodule to f364eb6
( #4060 )
2024-04-30 17:25:39 -04:00
Daniel Hiltgen
23d23409a0
Update llama.cpp ( #4036 )
...
* Bump llama.cpp to b2761
* Adjust types for bump
2024-04-29 23:18:48 -04:00
ManniX-ITA
c942e4a07b
Fixed startup sequence to report model loading
2024-04-17 17:40:32 +02:00
Jeffrey Morgan
7c9792a6e0
Support unicode characters in model path ( #3681 )
...
* parse wide argv characters on windows
* cleanup
* move cleanup to end of `main`
2024-04-16 17:00:12 -04:00
Daniel Hiltgen
0a0e9f3e0f
Apply 01-cache.diff
2024-04-01 16:48:18 -07:00
Daniel Hiltgen
58d95cc9bd
Switch back to subprocessing for llama.cpp
...
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00
Jeffrey Morgan
f5ca7f8c8e
add license in file header for vendored llama.cpp code ( #3351 )
2024-03-26 16:23:23 -04:00
Daniel Hiltgen
43799532c1
Bump llama.cpp to b2474
...
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Jeffrey Morgan
e95ffc7448
llama: remove server static assets ( #3174 )
2024-03-15 19:24:12 -07:00
Daniel Hiltgen
85129d3a32
Adapt our build for imported server.cpp
2024-03-12 14:57:15 -07:00
Daniel Hiltgen
9ac6440da3
Import server.cpp as of b2356
2024-03-12 13:58:06 -07:00