Commit graph

63 commits

Author SHA1 Message Date
Daniel Hiltgen
90ca84172c
Fix embeddings memory corruption (#6467)
* Fix embeddings memory corruption

The patch was leading to a buffer overrun corruption.  Once removed though, parallism
in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count.  To
work around this, only use slot 0 for embeddings.

* Fix embed integration test assumption

The token eval count has changed with recent llama.cpp bumps (0.3.5+)
2024-08-22 14:51:42 -07:00
Daniel Hiltgen
74d45f0102 Refactor linux packaging
This adjusts linux to follow a similar model to windows with a discrete archive
(zip/tgz) to cary the primary executable, and dependent libraries. Runners are
still carried as payloads inside the main binary

Darwin retain the payload model where the go binary is fully self contained.
2024-08-19 09:38:53 -07:00
Jeffrey Morgan
15c2d8fe14
server: parallelize embeddings in API web handler instead of in subprocess runner (#6220)
For simplicity, perform parallelization of embedding requests in the API handler instead of offloading this to the subprocess runner. This keeps the scheduling story simpler as it builds on existing parallel requests, similar to existing text completion functionality.
2024-08-11 11:57:10 -07:00
Jeffrey Morgan
e04c7012c2
update llama.cpp submodule to 1e6f6554 (#6208) 2024-08-06 15:11:45 -04:00
royjhan
86b907f82a
sort batch results (#6189) 2024-08-05 16:55:34 -07:00
Michael Yang
6a07344786 line feed 2024-08-04 17:25:41 -07:00
royjhan
1b44d873e7
Add Metrics to api\embed response (#5709)
* add prompt tokens to embed response

* rm slog

* metrics

* types

* prompt n

* clean up

* reset submodule

* update tests

* test name

* list metrics
2024-07-30 13:12:21 -07:00
Jeffrey Morgan
68ee42f995
update llama.cpp submodule to 6eeaeba1 (#6039) 2024-07-29 13:20:26 -07:00
Daniel Hiltgen
e12fff8810 Enable windows error dialog for subprocess startup
Make sure if something goes wrong spawning the process, the user gets
enough info to be able to try to self correct, or at least file a bug
with details so we can fix it.  Once the process starts, we immediately
change back to the recommended setting to prevent the blocking dialog.
This ensures if the model fails to load (OOM, unsupported model type,
etc.) the process will exit quickly and we can scan the stdout/stderr
of the subprocess for the reason to report via API.
2024-07-22 14:07:27 -07:00
royjhan
b9f5e16c80
Introduce /api/embed endpoint supporting batch embedding (#5127)
* Initial Batch Embedding

* Revert "Initial Batch Embedding"

This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29.

* Initial Draft

* mock up notes

* api/embed draft

* add server function

* check normalization

* clean up

* normalization

* playing around with truncate stuff

* Truncation

* Truncation

* move normalization to go

* Integration Test Template

* Truncation Integration Tests

* Clean up

* use float32

* move normalize

* move normalize test

* refactoring

* integration float32

* input handling and handler testing

* Refactoring of legacy and new

* clear comments

* merge conflicts

* touches

* embedding type 64

* merge conflicts

* fix hanging on single string

* refactoring

* test values

* set context length

* clean up

* testing clean up

* testing clean up

* remove function closure

* Revert "remove function closure"

This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787.

* remove function closure

* remove redundant error check

* clean up

* more clean up

* clean up
2024-07-15 12:14:24 -07:00
Jeffrey Morgan
d8def1ff94
llm: allow gemma 2 to context shift (#5534) 2024-07-07 13:41:51 -04:00
Jeffrey Morgan
0e09c380fc
llm: print caching notices in debug only (#5533) 2024-07-07 12:38:04 -04:00
Jeffrey Morgan
2cc854f8cb
llm: fix missing dylibs by restoring old build behavior on Linux and macOS (#5511)
* Revert "fix cmake build (#5505)"

This reverts commit 4fd5f3526a.

* llm: fix missing dylibs by restoring old build behavior

* crlf -> lf
2024-07-05 21:48:31 -04:00
Jeffrey Morgan
4fd5f3526a
fix cmake build (#5505) 2024-07-05 19:07:01 -04:00
Jeffrey Morgan
8f8e736b13
update llama.cpp submodule to d7fd29f (#5475) 2024-07-05 13:25:58 -04:00
Jeffrey Morgan
d89454de80
Use slot with cached prompt instead of least recently used (#5492)
* Use common prefix to select slot

* actually report `longest`
2024-07-05 12:32:47 -04:00
royjhan
3b5a4a77f3
Return Correct Prompt Eval Count Regardless of Cache Prompt (#5371)
* openai compatibility

* Revert "openai compatibility"

This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c.

* remove erroneous subtraction of prompt cache
2024-07-03 13:46:23 -07:00
Jeffrey Morgan
717f7229eb
Do not shift context for sliding window models (#5368)
* Do not shift context for sliding window models

* truncate prompt > 2/3 tokens

* only target gemma2
2024-06-28 19:39:31 -07:00
Michael Yang
9d91e5e587 remove confusing log message 2024-06-19 11:14:11 -07:00
Daniel Hiltgen
fb9cdfa723 Fix server.cpp for the new cuda build macros 2024-06-14 14:51:40 -07:00
Jeffrey Morgan
ead259d877
llm: fix seed value not being applied to requests (#4986) 2024-06-11 14:24:41 -07:00
Jeffrey Morgan
34f142797a
llm: always add bos token to prompt (#4941)
* fix embedding by adding fixes from llama.cpp upstream

* remove assert

---------

Co-authored-by: Jesper Ek <deadbeef84@gmail.com>
2024-06-08 18:47:10 -07:00
Michael Yang
829ff87bd1
revert tokenize ffi (#4761)
* Revert "use `int32_t` for call to tokenize (#4738)"

This reverts commit 763bb65dbb.

* Revert "vocab only"

This reverts commit bf54c845e9.

* Revert "use ffi for tokenizing/detokenizing"

This reverts commit 26a00a0410.
2024-05-31 18:54:21 -07:00
Michael Yang
de781b37c8 rm unused infill 2024-05-29 11:26:47 -07:00
Michael Yang
3e21799377 rm unused system prompt 2024-05-29 11:26:47 -07:00
Michael Yang
26a00a0410 use ffi for tokenizing/detokenizing 2024-05-29 11:26:47 -07:00
Michael Yang
714adb8bd1
bump (#4597) 2024-05-23 14:16:26 -07:00
Daniel Hiltgen
b37b496a12 Wire up load progress
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Sam
e15307fdf4
feat: add support for flash_attn (#4120)
* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: add flash_attn support
2024-05-20 13:36:03 -07:00
Michael Yang
58876091f7 log clean up 2024-05-09 14:55:36 -07:00
Daniel Hiltgen
920a4b0794 Merge remote-tracking branch 'upstream/main' into pr3702 2024-05-08 16:44:35 -07:00
Michael Yang
44869c59d6 omit prompt and generate settings from final response 2024-05-03 17:00:02 -07:00
jmorganca
fcf4d60eee llm: add back check for empty token cache 2024-04-30 17:38:44 -04:00
Jeffrey Morgan
18d9a7e1f1
update llama.cpp submodule to f364eb6 (#4060) 2024-04-30 17:25:39 -04:00
Daniel Hiltgen
23d23409a0
Update llama.cpp (#4036)
* Bump llama.cpp to b2761

* Adjust types for bump
2024-04-29 23:18:48 -04:00
ManniX-ITA
c942e4a07b
Fixed startup sequence to report model loading 2024-04-17 17:40:32 +02:00
Jeffrey Morgan
7c9792a6e0
Support unicode characters in model path (#3681)
* parse wide argv characters on windows

* cleanup

* move cleanup to end of `main`
2024-04-16 17:00:12 -04:00
Daniel Hiltgen
0a0e9f3e0f Apply 01-cache.diff 2024-04-01 16:48:18 -07:00
Daniel Hiltgen
58d95cc9bd Switch back to subprocessing for llama.cpp
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems.  This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00
Jeffrey Morgan
f5ca7f8c8e
add license in file header for vendored llama.cpp code (#3351) 2024-03-26 16:23:23 -04:00
Daniel Hiltgen
43799532c1 Bump llama.cpp to b2474
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Jeffrey Morgan
e95ffc7448
llama: remove server static assets (#3174) 2024-03-15 19:24:12 -07:00
Daniel Hiltgen
85129d3a32 Adapt our build for imported server.cpp 2024-03-12 14:57:15 -07:00
Daniel Hiltgen
9ac6440da3 Import server.cpp as of b2356 2024-03-12 13:58:06 -07:00
racerole
53c107e20e
chore: fix typo (#3073)
Signed-off-by: racerole <jiangyifeng@outlook.com>
2024-03-12 14:09:22 -04:00
Bruce MacDonald
b80661e8c7
relay load model errors to the client (#3065) 2024-03-11 16:48:27 -04:00
Jeffrey Morgan
369eda65f5
update llama.cpp submodule to ceca1ae (#3064) 2024-03-11 12:57:48 -07:00
Jeffrey Morgan
1ffb1e2874
update llama.cpp submodule to 77d1ac7 (#3030) 2024-03-09 15:55:34 -08:00
Jeffrey Morgan
0e4669b04f
update llama.cpp submodule to 6cdabe6 (#2999) 2024-03-08 00:26:20 -08:00
Jeffrey Morgan
21347e1ed6
update llama.cpp submodule to c29af7e (#2868) 2024-03-01 15:26:04 -08:00