Jeffrey Morgan
08f1e18965
Offload layers to GPU based on new model size estimates ( #1850 )
...
* select layers based on estimated model memory usage
* always account for scratch vram
* dont load +1 layers
* better estmation for graph alloc
* Update gpu/gpu_darwin.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update llm/llm.go
* add overhead for cuda memory
* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* fix build error on linux
* address comments
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2024-01-08 16:42:00 -05:00
Bruce MacDonald
0b3118e0af
fix: relay request opts to loaded llm prediction ( #1761 )
2024-01-03 12:01:42 -05:00
Daniel Hiltgen
d966b730ac
Switch windows build to fully dynamic
...
Refactor where we store build outputs, and support a fully dynamic loading
model on windows so the base executable has no special dependencies thus
doesn't require a special PATH.
2024-01-02 15:36:16 -08:00
Daniel Hiltgen
7555ea44f8
Revamp the dynamic library shim
...
This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.
This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.
2023-12-20 14:45:57 -08:00
Daniel Hiltgen
54dbfa4c4a
Carry ggml-metal.metal as payload
2023-12-19 09:05:46 -08:00
Daniel Hiltgen
35934b2e05
Adapted rocm support to cgo based llama.cpp
2023-12-19 09:05:46 -08:00
Daniel Hiltgen
d4cd695759
Add cgo implementation for llama.cpp
...
Run the server.cpp directly inside the Go runtime via cgo
while retaining the LLM Go abstractions.
2023-12-19 09:05:46 -08:00
Bruce MacDonald
811b1f03c8
deprecate ggml
...
- remove ggml runner
- automatically pull gguf models when ggml detected
- tell users to update to gguf in the case automatic pull fails
Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>
2023-12-19 09:05:46 -08:00
Bruce MacDonald
6ee8c80199
restore model load duration on generate response ( #1524 )
...
* restore model load duration on generate response
- set model load duration on generate and chat done response
- calculate createAt time when response created
* remove checkpoints predict opts
* Update routes.go
2023-12-14 12:15:50 -05:00
Bruce MacDonald
3144e2a439
exponential back-off ( #1484 )
2023-12-12 12:33:02 -05:00
Bruce MacDonald
c0960e29b5
retry on concurrent request failure ( #1483 )
...
- remove parallel
2023-12-12 12:14:35 -05:00
Patrick Devine
910e9401d0
Multimodal support ( #1216 )
...
---------
Co-authored-by: Matt Apperson <mattapperson@Matts-MacBook-Pro.local>
2023-12-11 13:56:22 -08:00
Jeffrey Morgan
fa2f095bd9
fix model name returned by /api/generate
being different than the model name provided
2023-12-10 11:42:15 -05:00
Jeffrey Morgan
2dd040d04c
do not use --parallel 2
for old runners
2023-12-09 20:17:33 -05:00
Bruce MacDonald
bbe41ce41a
fix: parallel queueing race condition caused silent failure ( #1445 )
...
* fix: queued request failures
- increase parallel requests to 2 to complete queued request, queueing is managed in ollama
* log steam errors
2023-12-09 14:14:02 -05:00
Michael Yang
b9495ea162
load projectors
2023-12-05 14:36:12 -08:00
Bruce MacDonald
195e3d9dbd
chat api endpoint ( #1392 )
2023-12-05 14:57:33 -05:00
Jeffrey Morgan
00d06619a1
Revert "chat api ( #991 )" while context variable is fixed
...
This reverts commit 7a0899d62d
.
2023-12-04 21:16:27 -08:00
Bruce MacDonald
7a0899d62d
chat api ( #991 )
...
- update chat docs
- add messages chat endpoint
- remove deprecated context and template generate parameters from docs
- context and template are still supported for the time being and will continue to work as expected
- add partial response to chat history
2023-12-04 18:01:06 -05:00
Jing Zhang
82b9b329ff
windows CUDA support ( #1262 )
...
* Support cuda build in Windows
* Enable dynamic NumGPU allocation for Windows
2023-11-24 17:16:36 -05:00
Jeffrey Morgan
a3fcecf943
only set main_gpu
if value > 0 is provided
2023-11-20 19:54:04 -05:00
Purinda Gunasekara
be61a81758
main-gpu argument is not getting passed to llamacpp, fixed. ( #1192 )
2023-11-20 10:52:52 -05:00
Jeffrey Morgan
36a3bbf65f
Update llm/llama.go
2023-11-18 21:25:07 -05:00
Bruce MacDonald
43a726149d
fix potentially inaccurate error message
2023-11-18 21:25:07 -05:00
Jeffrey Morgan
41434a7cdc
build intel mac with correct binary and compile flags
2023-11-16 22:14:51 -05:00
Jeffrey Morgan
5cba29b9d6
JSON mode: add `"format" as an api parameter ( #1051 )
...
* add `"format": "json"` as an API parameter
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2023-11-09 16:44:02 -08:00
Bruce MacDonald
1ae84bc2a2
skip gpu if less than 2GB VRAM are available ( #1059 )
2023-11-09 13:16:16 -08:00
Jeffrey Morgan
c44b619428
remove unused fmt.Println
2023-11-03 17:24:58 -07:00
Jeffrey Morgan
17678b7225
Restore system prompt on requests and default num_keep
to 0
2023-11-03 13:25:25 -07:00
Jeffrey Morgan
2e53704685
default rope params to 0 for new models ( #968 )
2023-11-02 08:41:30 -07:00
Michael Yang
642128b75a
append LD_LIBRARY_PATH
2023-10-31 15:54:49 -07:00
Bruce MacDonald
6d283882b1
catch insufficient permissions nvidia err ( #934 )
2023-10-27 12:42:40 -04:00
Bruce MacDonald
2665f3c28e
offload 75% of available vram to improve stability ( #921 )
2023-10-26 20:49:55 -04:00
Jeffrey Morgan
7ed5a39bc7
simpler check for model loading compatibility errors
2023-10-19 14:50:49 -04:00
Jeffrey Morgan
a7dad24d92
add error for falcon
and starcoder
vocab compatibility ( #844 )
...
add error for falcon and starcoder vocab compatibility
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2023-10-19 12:18:31 -04:00
Michael Yang
235e43d7f6
Merge pull request #833 from discovertomorrow/leadingspace
...
Fix Issue with Leading Whitespaces in Decoded Context
2023-10-18 13:52:48 -07:00
Arne Müller
730996e530
use TrimPrefix instead of TrimLeft
2023-10-18 22:51:30 +02:00
Arne Müller
ce6197a8e0
removed redundant strings.CutPrefix from Decode
2023-10-18 22:47:20 +02:00
Arne Müller
46b9953f32
use strings.TrimLeft to remove spaces
2023-10-18 22:41:19 +02:00
Bruce MacDonald
565648f3f7
relay CUDA errors to the client ( #825 )
2023-10-18 15:36:56 -04:00
Arne Müller
90c49bed57
moved removal of leading space into Predict
2023-10-18 20:08:26 +02:00
Arne Müller
5dc0cff459
fix whitespace removal
2023-10-18 08:15:27 +02:00
Michael Yang
b36b0b71f8
use cut prefix
2023-10-17 14:01:39 -07:00
Michael Yang
094df37563
remove unused struct
2023-10-17 14:01:38 -07:00
Bruce MacDonald
bd93a94abd
fix MB VRAM log output ( #824 )
2023-10-17 15:35:16 -04:00
Michael Yang
f55bdb6f10
Merge pull request #799 from deichbewohner/jsonmarshaling
...
Fix JSON Marshal Escaping for Special Characters
2023-10-17 08:46:02 -07:00
Michael Yang
2870a9bfc8
Merge pull request #812 from jmorganca/mxyng/fix-format-string
...
fix: wrong format string type
2023-10-17 08:40:49 -07:00
Arne Müller
8fa3f366ad
Removed newline trimming and used buffer directly in POST request.
2023-10-17 08:17:35 +02:00
Michael Yang
fddb303f23
fix: format string wrong type
2023-10-16 16:14:28 -07:00
Michael Yang
cb4a80b693
fix: regression unsupported metal types
...
omitting `--n-gpu-layers` means use metal on macos which isn't correct
since ollama uses `num_gpu=0` to explicitly disable gpu for file types
that are not implemented in metal
2023-10-16 14:37:20 -07:00