Jeffrey Morgan
717f7229eb
Do not shift context for sliding window models ( #5368 )
...
* Do not shift context for sliding window models
* truncate prompt > 2/3 tokens
* only target gemma2
2024-06-28 19:39:31 -07:00
Michael Yang
de2163dafd
gemma2 graph
2024-06-27 13:34:52 -07:00
Jeffrey Morgan
4d311eb731
llm: architecture patch ( #5316 )
2024-06-26 21:38:12 -07:00
Blake Mizerany
cb42e607c5
llm: speed up gguf decoding by a lot ( #5246 )
...
Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:
* Too many allocations when decoding strings
* Hitting disk for each read of each key and value, resulting in a
not-okay amount of syscalls/disk I/O.
The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.
This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.
Also, this fixes a broken test that was not encoding valid GGUF.
2024-06-24 21:47:52 -07:00
Daniel Hiltgen
5bf5aeec01
Refine mmap default logic on linux
...
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
2024-06-20 11:07:04 -07:00
Michael Yang
8e0641a9bf
handle asymmetric embedding KVs
2024-06-20 09:57:27 -07:00
Michael Yang
9d91e5e587
remove confusing log message
2024-06-19 11:14:11 -07:00
Daniel Hiltgen
96624aa412
Merge pull request #5072 from dhiltgen/windows_path
...
Move libraries out of users path
2024-06-19 09:13:39 -07:00
Michael Yang
e873841cbb
deepseek v2 graph
2024-06-18 15:35:12 -07:00
Daniel Hiltgen
359b15a597
Handle models with divergent layer sizes
...
The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.
2024-06-18 11:05:34 -07:00
Daniel Hiltgen
7784ca33ce
Tighten up memory prediction logging
...
Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.
2024-06-18 09:15:35 -07:00
Daniel Hiltgen
171796791f
Adjust mmap logic for cuda windows for faster model load
...
On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off. This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.
2024-06-17 16:54:30 -07:00
Daniel Hiltgen
b0930626c5
Add back lower level parallel flags
...
nvcc supports parallelism (threads) and cmake + make can use -j,
while msbuild requires /p:CL_MPcount=8
2024-06-17 13:44:46 -07:00
Daniel Hiltgen
e890be4814
Revert "More parallelism on windows generate"
...
This reverts commit 0577af98f4
.
2024-06-17 13:32:46 -07:00
Daniel Hiltgen
b2799f111b
Move libraries out of users path
...
We update the PATH on windows to get the CLI mapped, but this has
an unintended side effect of causing other apps that may use our bundled
DLLs to get terminated when we upgrade.
2024-06-17 13:12:18 -07:00
Jeffrey Morgan
152fc202f5
llm: update llama.cpp commit to 7c26775
( #4896 )
...
* llm: update llama.cpp submodule to `7c26775`
* disable `LLAMA_BLAS` for now
* `-DLLAMA_OPENMP=off`
2024-06-17 15:56:16 -04:00
Daniel Hiltgen
4b0050cf0e
Merge pull request #5037 from dhiltgen/faster_win_build
...
More parallelism on windows generate
2024-06-15 08:03:05 -07:00
Daniel Hiltgen
0577af98f4
More parallelism on windows generate
...
Make the build faster
2024-06-15 07:44:55 -07:00
Daniel Hiltgen
da3bf23354
Workaround gfx900 SDMA bugs
...
Implement support for GPU env var workarounds, and leverage
this for the Vega RX 56 which needs
HSA_ENABLE_SDMA=0 set to work properly
2024-06-14 15:38:13 -07:00
Daniel Hiltgen
17df6520c8
Remove mmap related output calc logic
2024-06-14 14:55:50 -07:00
Daniel Hiltgen
6f351bf586
review comments and coverage
2024-06-14 14:55:50 -07:00
Daniel Hiltgen
fc37c192ae
Refine CPU load behavior with system memory visibility
2024-06-14 14:51:40 -07:00
Daniel Hiltgen
6fd04ca922
Improve multi-gpu handling at the limit
...
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
2024-06-14 14:51:40 -07:00
Daniel Hiltgen
fb9cdfa723
Fix server.cpp for the new cuda build macros
2024-06-14 14:51:40 -07:00
Michael Yang
217f60c3d9
Merge pull request #4987 from ollama/mxyng/revert-byte-order
...
Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order"
2024-06-11 16:04:20 -07:00
Michael Yang
7bdcd1da94
Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order"
...
This reverts commit f5f245cc15
, reversing
changes made to 94d37fdcae
.
this change broke gguf v2 which is incorrectly detected as big endian
2024-06-11 15:56:17 -07:00
Jeffrey Morgan
ead259d877
llm: fix seed value not being applied to requests ( #4986 )
2024-06-11 14:24:41 -07:00
Michael Yang
f5f245cc15
Merge pull request #4938 from ollama/mxyng/fix-byte-order
...
fix parsing big endian gguf
2024-06-10 09:38:12 -07:00
Craig Hughes
b84aea1685
Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. ( #3782 )
2024-06-09 10:57:09 -07:00
Jeffrey Morgan
34f142797a
llm: always add bos token to prompt ( #4941 )
...
* fix embedding by adding fixes from llama.cpp upstream
* remove assert
---------
Co-authored-by: Jesper Ek <deadbeef84@gmail.com>
2024-06-08 18:47:10 -07:00
Michael Yang
620d5c569e
fix parsing big endian gguf
2024-06-08 12:35:26 -07:00
Daniel Hiltgen
cddc63381c
Merge pull request #4909 from dhiltgen/oneapi_disable
...
Add ability to skip oneapi generate
2024-06-07 14:07:15 -07:00
Michael Yang
030e765e76
fix create model when template detection errors
2024-06-07 10:51:35 -07:00
Daniel Hiltgen
ab8c929e20
Add ability to skip oneapi generate
...
This follows the same pattern for cuda and rocm to allow
disabling the build even when we detect the dependent libraries
2024-06-07 08:32:49 -07:00
Jeffrey Morgan
ce0dc33cb8
llm: patch to fix qwen 2 temporarily on nvidia ( #4897 )
2024-06-06 23:14:33 -07:00
Michael Yang
9b6c2e6eb6
detect chat template from KV
2024-06-06 16:03:47 -07:00
Michael Yang
6297f85606
gofmt, goimports
2024-06-04 13:20:24 -07:00
Michael Yang
e40145a39d
lint
2024-06-04 11:13:30 -07:00
Michael Yang
c895a7d13f
some gocritic
2024-06-04 11:13:30 -07:00
Michael Yang
04f3c12bb7
replace x/exp/slices with slices
2024-06-04 11:13:30 -07:00
Michael Yang
829ff87bd1
revert tokenize ffi ( #4761 )
...
* Revert "use `int32_t` for call to tokenize (#4738 )"
This reverts commit 763bb65dbb
.
* Revert "vocab only"
This reverts commit bf54c845e9
.
* Revert "use ffi for tokenizing/detokenizing"
This reverts commit 26a00a0410
.
2024-05-31 18:54:21 -07:00
Jeffrey Morgan
763bb65dbb
use int32_t
for call to tokenize ( #4738 )
...
* use `int32_t` for call to tokenize
* variable naming
* cleanup
* fix crash
2024-05-30 21:43:30 -07:00
Jeffrey Morgan
7ca9605f54
speed up tests by only building static lib ( #4740 )
2024-05-30 21:43:15 -07:00
Michael Yang
eb2c443a79
Merge pull request #4736 from ollama/mxyng/vocab-only
...
vocab only for tokenize
2024-05-30 17:21:00 -07:00
Jeffrey Morgan
a50a87a7b8
partial offloading: allow flash attention and disable mmap ( #4734 )
...
* partial offloading: allow flash attention and disable mmap
* allow mmap with num_gpu=0
2024-05-30 16:58:01 -07:00
Michael Yang
bf54c845e9
vocab only
2024-05-30 16:49:28 -07:00
Jeffrey Morgan
22f5c12ced
Update llama.cpp submodule to 5921b8f0
( #4731 )
...
* update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603`
* add patch
2024-05-30 16:20:22 -07:00
Michael Yang
de781b37c8
rm unused infill
2024-05-29 11:26:47 -07:00
Michael Yang
3e21799377
rm unused system prompt
2024-05-29 11:26:47 -07:00
Michael Yang
26a00a0410
use ffi for tokenizing/detokenizing
2024-05-29 11:26:47 -07:00