royjhan
b9f5e16c80
Introduce /api/embed
endpoint supporting batch embedding ( #5127 )
...
* Initial Batch Embedding
* Revert "Initial Batch Embedding"
This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29.
* Initial Draft
* mock up notes
* api/embed draft
* add server function
* check normalization
* clean up
* normalization
* playing around with truncate stuff
* Truncation
* Truncation
* move normalization to go
* Integration Test Template
* Truncation Integration Tests
* Clean up
* use float32
* move normalize
* move normalize test
* refactoring
* integration float32
* input handling and handler testing
* Refactoring of legacy and new
* clear comments
* merge conflicts
* touches
* embedding type 64
* merge conflicts
* fix hanging on single string
* refactoring
* test values
* set context length
* clean up
* testing clean up
* testing clean up
* remove function closure
* Revert "remove function closure"
This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787.
* remove function closure
* remove redundant error check
* clean up
* more clean up
* clean up
2024-07-15 12:14:24 -07:00
Jeffrey Morgan
d8def1ff94
llm: allow gemma 2 to context shift ( #5534 )
2024-07-07 13:41:51 -04:00
Jeffrey Morgan
0e09c380fc
llm: print caching notices in debug only ( #5533 )
2024-07-07 12:38:04 -04:00
Jeffrey Morgan
2cc854f8cb
llm: fix missing dylibs by restoring old build behavior on Linux and macOS ( #5511 )
...
* Revert "fix cmake build (#5505 )"
This reverts commit 4fd5f3526a
.
* llm: fix missing dylibs by restoring old build behavior
* crlf -> lf
2024-07-05 21:48:31 -04:00
Jeffrey Morgan
4fd5f3526a
fix cmake build ( #5505 )
2024-07-05 19:07:01 -04:00
Jeffrey Morgan
8f8e736b13
update llama.cpp submodule to d7fd29f
( #5475 )
2024-07-05 13:25:58 -04:00
Jeffrey Morgan
d89454de80
Use slot with cached prompt instead of least recently used ( #5492 )
...
* Use common prefix to select slot
* actually report `longest`
2024-07-05 12:32:47 -04:00
royjhan
3b5a4a77f3
Return Correct Prompt Eval Count Regardless of Cache Prompt ( #5371 )
...
* openai compatibility
* Revert "openai compatibility"
This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c.
* remove erroneous subtraction of prompt cache
2024-07-03 13:46:23 -07:00
Jeffrey Morgan
717f7229eb
Do not shift context for sliding window models ( #5368 )
...
* Do not shift context for sliding window models
* truncate prompt > 2/3 tokens
* only target gemma2
2024-06-28 19:39:31 -07:00
Michael Yang
9d91e5e587
remove confusing log message
2024-06-19 11:14:11 -07:00
Daniel Hiltgen
fb9cdfa723
Fix server.cpp for the new cuda build macros
2024-06-14 14:51:40 -07:00
Jeffrey Morgan
ead259d877
llm: fix seed value not being applied to requests ( #4986 )
2024-06-11 14:24:41 -07:00
Jeffrey Morgan
34f142797a
llm: always add bos token to prompt ( #4941 )
...
* fix embedding by adding fixes from llama.cpp upstream
* remove assert
---------
Co-authored-by: Jesper Ek <deadbeef84@gmail.com>
2024-06-08 18:47:10 -07:00
Michael Yang
829ff87bd1
revert tokenize ffi ( #4761 )
...
* Revert "use `int32_t` for call to tokenize (#4738 )"
This reverts commit 763bb65dbb
.
* Revert "vocab only"
This reverts commit bf54c845e9
.
* Revert "use ffi for tokenizing/detokenizing"
This reverts commit 26a00a0410
.
2024-05-31 18:54:21 -07:00
Michael Yang
de781b37c8
rm unused infill
2024-05-29 11:26:47 -07:00
Michael Yang
3e21799377
rm unused system prompt
2024-05-29 11:26:47 -07:00
Michael Yang
26a00a0410
use ffi for tokenizing/detokenizing
2024-05-29 11:26:47 -07:00
Michael Yang
714adb8bd1
bump ( #4597 )
2024-05-23 14:16:26 -07:00
Daniel Hiltgen
b37b496a12
Wire up load progress
...
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Sam
e15307fdf4
feat: add support for flash_attn ( #4120 )
...
* feat: enable flash attention if supported
* feat: enable flash attention if supported
* feat: enable flash attention if supported
* feat: add flash_attn support
2024-05-20 13:36:03 -07:00
Michael Yang
58876091f7
log clean up
2024-05-09 14:55:36 -07:00
Daniel Hiltgen
920a4b0794
Merge remote-tracking branch 'upstream/main' into pr3702
2024-05-08 16:44:35 -07:00
Michael Yang
44869c59d6
omit prompt and generate settings from final response
2024-05-03 17:00:02 -07:00
jmorganca
fcf4d60eee
llm: add back check for empty token cache
2024-04-30 17:38:44 -04:00
Jeffrey Morgan
18d9a7e1f1
update llama.cpp submodule to f364eb6
( #4060 )
2024-04-30 17:25:39 -04:00
Daniel Hiltgen
23d23409a0
Update llama.cpp ( #4036 )
...
* Bump llama.cpp to b2761
* Adjust types for bump
2024-04-29 23:18:48 -04:00
ManniX-ITA
c942e4a07b
Fixed startup sequence to report model loading
2024-04-17 17:40:32 +02:00
Jeffrey Morgan
7c9792a6e0
Support unicode characters in model path ( #3681 )
...
* parse wide argv characters on windows
* cleanup
* move cleanup to end of `main`
2024-04-16 17:00:12 -04:00
Daniel Hiltgen
0a0e9f3e0f
Apply 01-cache.diff
2024-04-01 16:48:18 -07:00
Daniel Hiltgen
58d95cc9bd
Switch back to subprocessing for llama.cpp
...
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00
Jeffrey Morgan
f5ca7f8c8e
add license in file header for vendored llama.cpp code ( #3351 )
2024-03-26 16:23:23 -04:00
Daniel Hiltgen
43799532c1
Bump llama.cpp to b2474
...
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Jeffrey Morgan
e95ffc7448
llama: remove server static assets ( #3174 )
2024-03-15 19:24:12 -07:00
Daniel Hiltgen
85129d3a32
Adapt our build for imported server.cpp
2024-03-12 14:57:15 -07:00
Daniel Hiltgen
9ac6440da3
Import server.cpp as of b2356
2024-03-12 13:58:06 -07:00
racerole
53c107e20e
chore: fix typo ( #3073 )
...
Signed-off-by: racerole <jiangyifeng@outlook.com>
2024-03-12 14:09:22 -04:00
Bruce MacDonald
b80661e8c7
relay load model errors to the client ( #3065 )
2024-03-11 16:48:27 -04:00
Jeffrey Morgan
369eda65f5
update llama.cpp submodule to ceca1ae
( #3064 )
2024-03-11 12:57:48 -07:00
Jeffrey Morgan
1ffb1e2874
update llama.cpp submodule to 77d1ac7
( #3030 )
2024-03-09 15:55:34 -08:00
Jeffrey Morgan
0e4669b04f
update llama.cpp submodule to 6cdabe6
( #2999 )
2024-03-08 00:26:20 -08:00
Jeffrey Morgan
21347e1ed6
update llama.cpp submodule to c29af7e
( #2868 )
2024-03-01 15:26:04 -08:00
Jeffrey Morgan
4613a080e7
update llama.cpp submodule to 66c1968f7
( #2618 )
2024-02-20 17:42:31 -05:00
Taras Tsugrii
01ff2e14db
[nit] Remove unused msg local var. ( #2511 )
2024-02-20 14:02:34 -05:00
Jeffrey Morgan
f7231ad9ad
set shutting_down
to false
once shutdown is complete ( #2484 )
2024-02-13 17:48:41 -08:00
Daniel Hiltgen
6680761596
Shutdown faster
...
Make sure that when a shutdown signal comes, we shutdown quickly instead
of waiting for a potentially long exchange to wrap up.
2024-02-08 22:22:50 -08:00
Daniel Hiltgen
72b12c3be7
Bump llama.cpp to b1999
...
This requires an upstream change to support graceful termination,
carried as a patch.
2024-01-30 16:52:12 -08:00
Daniel Hiltgen
730dcfcc7a
Refine debug logging for llm
...
This wires up logging in llama.cpp to always go to stderr, and also
turns up logging if OLLAMA_DEBUG is set.
2024-01-22 12:26:49 -08:00
Daniel Hiltgen
ec3764538d
Probe GPUs before backend init
...
Detect potential error scenarios so we can fallback to CPU mode without
hitting asserts.
2024-01-21 15:59:38 -08:00
Daniel Hiltgen
1b249748ab
Add multiple CPU variants for Intel Mac
...
This also refines the build process for the ext_server build.
2024-01-17 15:08:54 -08:00
Jeffrey Morgan
557110d0ba
Disable mmap
with lora layers ( #1985 )
2024-01-13 23:36:31 -05:00