Daniel Hiltgen
7555ea44f8
Revamp the dynamic library shim
...
This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.
This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.
2023-12-20 14:45:57 -08:00
Daniel Hiltgen
6558f94ed0
Fix darwin intel build
2023-12-19 13:32:24 -08:00
Daniel Hiltgen
54dbfa4c4a
Carry ggml-metal.metal as payload
2023-12-19 09:05:46 -08:00
Daniel Hiltgen
3269535a4c
Refine handling of shim presence
...
This allows the CPU only builds to work on systems with Radeon cards
2023-12-19 09:05:46 -08:00
Daniel Hiltgen
1b991d0ba9
Refine build to support CPU only
...
If someone checks out the ollama repo and doesn't install the CUDA
library, this will ensure they can build a CPU only version
2023-12-19 09:05:46 -08:00
Daniel Hiltgen
9adca7f711
Bump llama.cpp to b1662 and set n_parallel=1
2023-12-19 09:05:46 -08:00
Daniel Hiltgen
89bbaafa64
Build linux using ubuntu 20.04
...
This changes the container-based linux build to use an older Ubuntu
distro to improve our compatibility matrix for older user machines
2023-12-19 09:05:46 -08:00
Daniel Hiltgen
35934b2e05
Adapted rocm support to cgo based llama.cpp
2023-12-19 09:05:46 -08:00
65a
f8ef4439e9
Use build tags to generate accelerated binaries for CUDA and ROCm on Linux.
...
The build tags rocm or cuda must be specified to both go generate and go build.
ROCm builds should have both ROCM_PATH set (and the ROCM SDK present) as well
as CLBlast installed (for GGML) and CLBlast_DIR set in the environment to the
CLBlast cmake directory (likely /usr/lib/cmake/CLBlast). Build tags are also
used to switch VRAM detection between cuda and rocm implementations, using
added "accelerator_foo.go" files which contain architecture specific functions
and variables. accelerator_none is used when no tags are set, and a helper
function addRunner will ignore it if it is the chosen accelerator. Fix go
generate commands, thanks @deadmeu for testing.
2023-12-19 09:05:46 -08:00
Daniel Hiltgen
d4cd695759
Add cgo implementation for llama.cpp
...
Run the server.cpp directly inside the Go runtime via cgo
while retaining the LLM Go abstractions.
2023-12-19 09:05:46 -08:00
Bruce MacDonald
811b1f03c8
deprecate ggml
...
- remove ggml runner
- automatically pull gguf models when ggml detected
- tell users to update to gguf in the case automatic pull fails
Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>
2023-12-19 09:05:46 -08:00
Jeffrey Morgan
6b5bdfa6c9
update runner submodule
2023-12-18 17:33:46 -05:00
Jeffrey Morgan
c063ee4af0
update runner submodule to fix hipblas build
2023-12-18 15:41:13 -05:00
Jeffrey Morgan
b85982eb91
update runner submodule
2023-12-18 12:43:31 -05:00
Bruce MacDonald
6ee8c80199
restore model load duration on generate response ( #1524 )
...
* restore model load duration on generate response
- set model load duration on generate and chat done response
- calculate createAt time when response created
* remove checkpoints predict opts
* Update routes.go
2023-12-14 12:15:50 -05:00
Jeffrey Morgan
31f0551dab
Update runner to support mixtral and mixture of experts (MoE) ( #1475 )
2023-12-13 17:15:10 -05:00
Michael Yang
4251b342de
Merge pull request #1469 from jmorganca/mxyng/model-types
...
remove per-model types
2023-12-12 12:27:03 -08:00
Bruce MacDonald
3144e2a439
exponential back-off ( #1484 )
2023-12-12 12:33:02 -05:00
Bruce MacDonald
c0960e29b5
retry on concurrent request failure ( #1483 )
...
- remove parallel
2023-12-12 12:14:35 -05:00
Patrick Devine
910e9401d0
Multimodal support ( #1216 )
...
---------
Co-authored-by: Matt Apperson <mattapperson@Matts-MacBook-Pro.local>
2023-12-11 13:56:22 -08:00
Michael Yang
56ffc3023a
remove per-model types
...
mostly replaced by decoding tensors except ggml models which only
support llama
2023-12-11 09:40:21 -08:00
Jeffrey Morgan
fa2f095bd9
fix model name returned by /api/generate
being different than the model name provided
2023-12-10 11:42:15 -05:00
Jeffrey Morgan
d9a250e9b5
seek to end of file when decoding older model formats
2023-12-09 21:14:35 -05:00
Jeffrey Morgan
944519ed16
seek to eof for older model binaries
2023-12-09 20:48:57 -05:00
Jeffrey Morgan
2dd040d04c
do not use --parallel 2
for old runners
2023-12-09 20:17:33 -05:00
Bruce MacDonald
bbe41ce41a
fix: parallel queueing race condition caused silent failure ( #1445 )
...
* fix: queued request failures
- increase parallel requests to 2 to complete queued request, queueing is managed in ollama
* log steam errors
2023-12-09 14:14:02 -05:00
Michael Yang
f1b049fed8
Merge pull request #1377 from jmorganca/mxyng/qwen
...
update for qwen
2023-12-06 12:31:51 -08:00
Michael Yang
b9495ea162
load projectors
2023-12-05 14:36:12 -08:00
Michael Yang
409bb9674e
Merge pull request #1308 from jmorganca/mxyng/split-from
...
split from into one or more models
2023-12-05 14:33:03 -08:00
Michael Yang
d3479c07a1
Merge pull request #1250 from jmorganca/mxyng/create-layer
...
refactor layer creation
2023-12-05 14:32:52 -08:00
Bruce MacDonald
195e3d9dbd
chat api endpoint ( #1392 )
2023-12-05 14:57:33 -05:00
Jeffrey Morgan
00d06619a1
Revert "chat api ( #991 )" while context variable is fixed
...
This reverts commit 7a0899d62d
.
2023-12-04 21:16:27 -08:00
Michael Yang
5a5dca13b2
comments
2023-12-04 16:59:23 -08:00
Michael Yang
72e7a49aa9
seek instead of copyn
2023-12-04 16:59:23 -08:00
Michael Yang
2cb0fa7d40
split from into one or more models
2023-12-04 16:59:23 -08:00
Michael Yang
b2816bca67
unnecessary ReadSeeker for DecodeGGML
2023-12-04 16:59:23 -08:00
Bruce MacDonald
7a0899d62d
chat api ( #991 )
...
- update chat docs
- add messages chat endpoint
- remove deprecated context and template generate parameters from docs
- context and template are still supported for the time being and will continue to work as expected
- add partial response to chat history
2023-12-04 18:01:06 -05:00
Michael Yang
6deebf2489
update for qwen
2023-12-04 11:38:05 -08:00
Jeffrey Morgan
16a9006306
add back f16c
instructions on intel mac
2023-11-26 15:59:49 -05:00
Jeffrey Morgan
9e4a316405
update submodule commit
2023-11-26 14:52:00 -05:00
Jing Zhang
82b9b329ff
windows CUDA support ( #1262 )
...
* Support cuda build in Windows
* Enable dynamic NumGPU allocation for Windows
2023-11-24 17:16:36 -05:00
Jongwook Choi
12e8c12d2b
Disable CUDA peer access as a workaround for multi-gpu inference bug ( #1261 )
...
When CUDA peer access is enabled, multi-gpu inference will produce
garbage output. This is a known bug of llama.cpp (or nvidia). Until the
upstream bug is fixed, we can disable CUDA peer access temporarily
to ensure correct output.
See #961 .
2023-11-24 14:05:57 -05:00
Jeffrey Morgan
d77dde126b
consistent cpu instructions on macos and linux
2023-11-22 16:26:46 -05:00
Michael Yang
199941cd15
fix: gguf int type
2023-11-22 11:40:30 -08:00
Michael Yang
a00fac4ec8
update llama.cpp
2023-11-21 09:50:02 -08:00
Jeffrey Morgan
a3fcecf943
only set main_gpu
if value > 0 is provided
2023-11-20 19:54:04 -05:00
Michael Yang
19b7a4d715
recent llama.cpp update added kernels for fp32, q5_0, and q5_1
2023-11-20 13:44:31 -08:00
Purinda Gunasekara
be61a81758
main-gpu argument is not getting passed to llamacpp, fixed. ( #1192 )
2023-11-20 10:52:52 -05:00
Jeffrey Morgan
13ba6df5ab
enable cpu instructions on intel macs
2023-11-19 23:20:26 -05:00
Jeffrey Morgan
36a3bbf65f
Update llm/llama.go
2023-11-18 21:25:07 -05:00
Bruce MacDonald
43a726149d
fix potentially inaccurate error message
2023-11-18 21:25:07 -05:00
Jeffrey Morgan
41434a7cdc
build intel mac with correct binary and compile flags
2023-11-16 22:14:51 -05:00
Jeffrey Morgan
5cba29b9d6
JSON mode: add `"format" as an api parameter ( #1051 )
...
* add `"format": "json"` as an API parameter
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2023-11-09 16:44:02 -08:00
Bruce MacDonald
1ae84bc2a2
skip gpu if less than 2GB VRAM are available ( #1059 )
2023-11-09 13:16:16 -08:00
Michael Yang
c5e1bbabda
instead of static number of parameters for each model family, get the real number from the tensors ( #1022 )
...
* parse tensor info
* refactor decoder
* return actual parameter count
* explicit rounding
* s/Human/HumanNumber/
2023-11-08 17:55:46 -08:00
Jeffrey Morgan
c44b619428
remove unused fmt.Println
2023-11-03 17:24:58 -07:00
Jeffrey Morgan
17678b7225
Restore system prompt on requests and default num_keep
to 0
2023-11-03 13:25:25 -07:00
Jeffrey Morgan
2e53704685
default rope params to 0 for new models ( #968 )
2023-11-02 08:41:30 -07:00
Michael Yang
642128b75a
append LD_LIBRARY_PATH
2023-10-31 15:54:49 -07:00
Jeffrey Morgan
3a1ed9ff70
restore building runner with AVX
on by default ( #900 )
2023-10-27 12:13:44 -07:00
Bruce MacDonald
6d283882b1
catch insufficient permissions nvidia err ( #934 )
2023-10-27 12:42:40 -04:00
Bruce MacDonald
2665f3c28e
offload 75% of available vram to improve stability ( #921 )
2023-10-26 20:49:55 -04:00
Jeffrey Morgan
b0c9cd0f3b
fix metal assertion errors
2023-10-24 00:32:36 -07:00
Jeffrey Morgan
77f61c6301
update submodule commit
2023-10-24 00:30:27 -07:00
Jeffrey Morgan
f3604534e5
update submodule commit
2023-10-23 23:59:12 -07:00
Michael Yang
0c7a00a264
bump submodules
...
pin to 9e70cc03229df19ca2d28ce23cc817198f897278 for now since
438c2ca83045a00ef244093d27e9ed41a8cb4ea9 is breaking
2023-10-23 11:17:59 -07:00
Michael Yang
36c160f1c3
Merge pull request #881 from jmorganca/mxyng/ggufv3
...
ggufv3
2023-10-23 10:50:45 -07:00
Michael Yang
c9167494cb
update default log target
2023-10-23 10:44:50 -07:00
Michael Yang
125d0a013a
ggufv3
...
ggufv3 adds support for big endianness, mainly for s390x architecture.
while that's not currently supported for ollama, the change is simple.
loosen version check to be more forward compatible. unless specified,
gguf versions other v1 will be decoded into v2.
2023-10-23 09:35:49 -07:00
Jeffrey Morgan
7ed5a39bc7
simpler check for model loading compatibility errors
2023-10-19 14:50:49 -04:00
Jeffrey Morgan
a7dad24d92
add error for falcon
and starcoder
vocab compatibility ( #844 )
...
add error for falcon and starcoder vocab compatibility
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2023-10-19 12:18:31 -04:00
Michael Yang
235e43d7f6
Merge pull request #833 from discovertomorrow/leadingspace
...
Fix Issue with Leading Whitespaces in Decoded Context
2023-10-18 13:52:48 -07:00
Arne Müller
730996e530
use TrimPrefix instead of TrimLeft
2023-10-18 22:51:30 +02:00
Arne Müller
ce6197a8e0
removed redundant strings.CutPrefix from Decode
2023-10-18 22:47:20 +02:00
Arne Müller
46b9953f32
use strings.TrimLeft to remove spaces
2023-10-18 22:41:19 +02:00
Bruce MacDonald
565648f3f7
relay CUDA errors to the client ( #825 )
2023-10-18 15:36:56 -04:00
Arne Müller
90c49bed57
moved removal of leading space into Predict
2023-10-18 20:08:26 +02:00
Arne Müller
5dc0cff459
fix whitespace removal
2023-10-18 08:15:27 +02:00
Michael Yang
08b0e04f40
Merge pull request #813 from jmorganca/mxyng/llama
...
refactor llm/llama.go
2023-10-17 14:05:58 -07:00
Michael Yang
b36b0b71f8
use cut prefix
2023-10-17 14:01:39 -07:00
Michael Yang
094df37563
remove unused struct
2023-10-17 14:01:38 -07:00
Bruce MacDonald
f3648fd206
Update llama.cpp gguf to latest ( #710 )
2023-10-17 16:55:16 -04:00
Bruce MacDonald
bd93a94abd
fix MB VRAM log output ( #824 )
2023-10-17 15:35:16 -04:00
Michael Yang
f55bdb6f10
Merge pull request #799 from deichbewohner/jsonmarshaling
...
Fix JSON Marshal Escaping for Special Characters
2023-10-17 08:46:02 -07:00
Michael Yang
2870a9bfc8
Merge pull request #812 from jmorganca/mxyng/fix-format-string
...
fix: wrong format string type
2023-10-17 08:40:49 -07:00
Arne Müller
8fa3f366ad
Removed newline trimming and used buffer directly in POST request.
2023-10-17 08:17:35 +02:00
Michael Yang
fddb303f23
fix: format string wrong type
2023-10-16 16:14:28 -07:00
Michael Yang
cb4a80b693
fix: regression unsupported metal types
...
omitting `--n-gpu-layers` means use metal on macos which isn't correct
since ollama uses `num_gpu=0` to explicitly disable gpu for file types
that are not implemented in metal
2023-10-16 14:37:20 -07:00
Arne Müller
ee94693b1a
handling unescaped json marshaling
2023-10-16 11:15:55 +02:00
Michael Yang
11d82d7b9b
update checkvram
2023-10-13 14:47:29 -07:00
Michael Yang
36fe2deebf
only check system memory on macos
2023-10-13 14:47:29 -07:00
Michael Yang
4a8931f634
check total (system + video) memory
2023-10-13 14:47:29 -07:00
Michael Yang
bd6e38fb1a
refactor memory check
2023-10-13 14:47:29 -07:00
Michael Yang
92189a5855
fix memory check
2023-10-13 14:47:29 -07:00
Michael Yang
d790bf9916
Merge pull request #783 from jmorganca/mxyng/fix-gpu-offloading
...
fix: offloading on low end GPUs
2023-10-13 14:36:44 -07:00
Michael Yang
35afac099a
do not use gpu binary when num_gpu == 0
2023-10-13 14:32:12 -07:00
Michael Yang
811c3d1900
no gpu if vram < 2GB
2023-10-13 14:32:12 -07:00
Bruce MacDonald
6fe178134d
improve api error handling ( #781 )
...
- remove new lines from llama.cpp error messages relayed to client
- check api option types and return error on wrong type
- change num layers from 95% VRAM to 92% VRAM
2023-10-13 16:57:10 -04:00
Bruce MacDonald
56497663c8
relay model runner error message to client ( #720 )
...
* give direction to user when runner fails
* also relay errors from timeout
* increase timeout to 3 minutes
2023-10-12 11:16:37 -04:00
Michael Yang
b599946b74
add format bytes
2023-10-11 14:08:23 -07:00