Commit graph

185 commits

Author SHA1 Message Date
Michael Yang
b9495ea162 load projectors 2023-12-05 14:36:12 -08:00
Michael Yang
409bb9674e
Merge pull request #1308 from jmorganca/mxyng/split-from
split from into one or more models
2023-12-05 14:33:03 -08:00
Michael Yang
d3479c07a1
Merge pull request #1250 from jmorganca/mxyng/create-layer
refactor layer creation
2023-12-05 14:32:52 -08:00
Bruce MacDonald
195e3d9dbd
chat api endpoint (#1392) 2023-12-05 14:57:33 -05:00
Jeffrey Morgan
00d06619a1 Revert "chat api (#991)" while context variable is fixed
This reverts commit 7a0899d62d.
2023-12-04 21:16:27 -08:00
Michael Yang
5a5dca13b2 comments 2023-12-04 16:59:23 -08:00
Michael Yang
72e7a49aa9 seek instead of copyn 2023-12-04 16:59:23 -08:00
Michael Yang
2cb0fa7d40 split from into one or more models 2023-12-04 16:59:23 -08:00
Michael Yang
b2816bca67 unnecessary ReadSeeker for DecodeGGML 2023-12-04 16:59:23 -08:00
Bruce MacDonald
7a0899d62d
chat api (#991)
- update chat docs
- add messages chat endpoint
- remove deprecated context and template generate parameters from docs
- context and template are still supported for the time being and will continue to work as expected
- add partial response to chat history
2023-12-04 18:01:06 -05:00
Michael Yang
6deebf2489 update for qwen 2023-12-04 11:38:05 -08:00
Jeffrey Morgan
16a9006306 add back f16c instructions on intel mac 2023-11-26 15:59:49 -05:00
Jeffrey Morgan
9e4a316405 update submodule commit 2023-11-26 14:52:00 -05:00
Jing Zhang
82b9b329ff
windows CUDA support (#1262)
* Support cuda build in Windows
* Enable dynamic NumGPU allocation for Windows
2023-11-24 17:16:36 -05:00
Jongwook Choi
12e8c12d2b
Disable CUDA peer access as a workaround for multi-gpu inference bug (#1261)
When CUDA peer access is enabled, multi-gpu inference will produce
garbage output. This is a known bug of llama.cpp (or nvidia). Until the
upstream bug is fixed, we can disable CUDA peer access temporarily
to ensure correct output.

See #961.
2023-11-24 14:05:57 -05:00
Jeffrey Morgan
d77dde126b consistent cpu instructions on macos and linux 2023-11-22 16:26:46 -05:00
Michael Yang
199941cd15 fix: gguf int type 2023-11-22 11:40:30 -08:00
Michael Yang
a00fac4ec8 update llama.cpp 2023-11-21 09:50:02 -08:00
Jeffrey Morgan
a3fcecf943 only set main_gpu if value > 0 is provided 2023-11-20 19:54:04 -05:00
Michael Yang
19b7a4d715 recent llama.cpp update added kernels for fp32, q5_0, and q5_1 2023-11-20 13:44:31 -08:00
Purinda Gunasekara
be61a81758
main-gpu argument is not getting passed to llamacpp, fixed. (#1192) 2023-11-20 10:52:52 -05:00
Jeffrey Morgan
13ba6df5ab enable cpu instructions on intel macs 2023-11-19 23:20:26 -05:00
Jeffrey Morgan
36a3bbf65f Update llm/llama.go 2023-11-18 21:25:07 -05:00
Bruce MacDonald
43a726149d fix potentially inaccurate error message 2023-11-18 21:25:07 -05:00
Jeffrey Morgan
41434a7cdc build intel mac with correct binary and compile flags 2023-11-16 22:14:51 -05:00
Jeffrey Morgan
5cba29b9d6
JSON mode: add `"format" as an api parameter (#1051)
* add `"format": "json"` as an API parameter
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2023-11-09 16:44:02 -08:00
Bruce MacDonald
1ae84bc2a2
skip gpu if less than 2GB VRAM are available (#1059) 2023-11-09 13:16:16 -08:00
Michael Yang
c5e1bbabda
instead of static number of parameters for each model family, get the real number from the tensors (#1022)
* parse tensor info

* refactor decoder

* return actual parameter count

* explicit rounding

* s/Human/HumanNumber/
2023-11-08 17:55:46 -08:00
Jeffrey Morgan
c44b619428 remove unused fmt.Println 2023-11-03 17:24:58 -07:00
Jeffrey Morgan
17678b7225 Restore system prompt on requests and default num_keep to 0 2023-11-03 13:25:25 -07:00
Jeffrey Morgan
2e53704685
default rope params to 0 for new models (#968) 2023-11-02 08:41:30 -07:00
Michael Yang
642128b75a append LD_LIBRARY_PATH 2023-10-31 15:54:49 -07:00
Jeffrey Morgan
3a1ed9ff70
restore building runner with AVX on by default (#900) 2023-10-27 12:13:44 -07:00
Bruce MacDonald
6d283882b1
catch insufficient permissions nvidia err (#934) 2023-10-27 12:42:40 -04:00
Bruce MacDonald
2665f3c28e
offload 75% of available vram to improve stability (#921) 2023-10-26 20:49:55 -04:00
Jeffrey Morgan
b0c9cd0f3b fix metal assertion errors 2023-10-24 00:32:36 -07:00
Jeffrey Morgan
77f61c6301 update submodule commit 2023-10-24 00:30:27 -07:00
Jeffrey Morgan
f3604534e5 update submodule commit 2023-10-23 23:59:12 -07:00
Michael Yang
0c7a00a264 bump submodules
pin to 9e70cc03229df19ca2d28ce23cc817198f897278 for now since
438c2ca83045a00ef244093d27e9ed41a8cb4ea9 is breaking
2023-10-23 11:17:59 -07:00
Michael Yang
36c160f1c3
Merge pull request #881 from jmorganca/mxyng/ggufv3
ggufv3
2023-10-23 10:50:45 -07:00
Michael Yang
c9167494cb update default log target 2023-10-23 10:44:50 -07:00
Michael Yang
125d0a013a ggufv3
ggufv3 adds support for big endianness, mainly for s390x architecture.
while that's not currently supported for ollama, the change is simple.

loosen version check to be more forward compatible. unless specified,
gguf versions other v1 will be decoded into v2.
2023-10-23 09:35:49 -07:00
Jeffrey Morgan
7ed5a39bc7 simpler check for model loading compatibility errors 2023-10-19 14:50:49 -04:00
Jeffrey Morgan
a7dad24d92
add error for falcon and starcoder vocab compatibility (#844)
add error for falcon and starcoder vocab compatibility
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2023-10-19 12:18:31 -04:00
Michael Yang
235e43d7f6
Merge pull request #833 from discovertomorrow/leadingspace
Fix Issue with Leading Whitespaces in Decoded Context
2023-10-18 13:52:48 -07:00
Arne Müller
730996e530 use TrimPrefix instead of TrimLeft 2023-10-18 22:51:30 +02:00
Arne Müller
ce6197a8e0 removed redundant strings.CutPrefix from Decode 2023-10-18 22:47:20 +02:00
Arne Müller
46b9953f32 use strings.TrimLeft to remove spaces 2023-10-18 22:41:19 +02:00
Bruce MacDonald
565648f3f7
relay CUDA errors to the client (#825) 2023-10-18 15:36:56 -04:00
Arne Müller
90c49bed57 moved removal of leading space into Predict 2023-10-18 20:08:26 +02:00
Arne Müller
5dc0cff459 fix whitespace removal 2023-10-18 08:15:27 +02:00
Michael Yang
08b0e04f40
Merge pull request #813 from jmorganca/mxyng/llama
refactor llm/llama.go
2023-10-17 14:05:58 -07:00
Michael Yang
b36b0b71f8 use cut prefix 2023-10-17 14:01:39 -07:00
Michael Yang
094df37563 remove unused struct 2023-10-17 14:01:38 -07:00
Bruce MacDonald
f3648fd206
Update llama.cpp gguf to latest (#710) 2023-10-17 16:55:16 -04:00
Bruce MacDonald
bd93a94abd
fix MB VRAM log output (#824) 2023-10-17 15:35:16 -04:00
Michael Yang
f55bdb6f10
Merge pull request #799 from deichbewohner/jsonmarshaling
Fix JSON Marshal Escaping for Special Characters
2023-10-17 08:46:02 -07:00
Michael Yang
2870a9bfc8
Merge pull request #812 from jmorganca/mxyng/fix-format-string
fix: wrong format string type
2023-10-17 08:40:49 -07:00
Arne Müller
8fa3f366ad Removed newline trimming and used buffer directly in POST request. 2023-10-17 08:17:35 +02:00
Michael Yang
fddb303f23 fix: format string wrong type 2023-10-16 16:14:28 -07:00
Michael Yang
cb4a80b693 fix: regression unsupported metal types
omitting `--n-gpu-layers` means use metal on macos which isn't correct
since ollama uses `num_gpu=0` to explicitly disable gpu for file types
that are not implemented in metal
2023-10-16 14:37:20 -07:00
Arne Müller
ee94693b1a handling unescaped json marshaling 2023-10-16 11:15:55 +02:00
Michael Yang
11d82d7b9b update checkvram 2023-10-13 14:47:29 -07:00
Michael Yang
36fe2deebf only check system memory on macos 2023-10-13 14:47:29 -07:00
Michael Yang
4a8931f634 check total (system + video) memory 2023-10-13 14:47:29 -07:00
Michael Yang
bd6e38fb1a refactor memory check 2023-10-13 14:47:29 -07:00
Michael Yang
92189a5855 fix memory check 2023-10-13 14:47:29 -07:00
Michael Yang
d790bf9916
Merge pull request #783 from jmorganca/mxyng/fix-gpu-offloading
fix: offloading on low end GPUs
2023-10-13 14:36:44 -07:00
Michael Yang
35afac099a do not use gpu binary when num_gpu == 0 2023-10-13 14:32:12 -07:00
Michael Yang
811c3d1900 no gpu if vram < 2GB 2023-10-13 14:32:12 -07:00
Bruce MacDonald
6fe178134d
improve api error handling (#781)
- remove new lines from llama.cpp error messages relayed to client
- check api option types and return error on wrong type
- change num layers from 95% VRAM to 92% VRAM
2023-10-13 16:57:10 -04:00
Bruce MacDonald
56497663c8
relay model runner error message to client (#720)
* give direction to user when runner fails
* also relay errors from timeout
* increase timeout to 3 minutes
2023-10-12 11:16:37 -04:00
Michael Yang
b599946b74 add format bytes 2023-10-11 14:08:23 -07:00
Bruce MacDonald
77295f716e
prevent waiting on exited command (#752)
* prevent waiting on exited command
* close llama runner once
2023-10-11 12:32:13 -04:00
Bruce MacDonald
f2ba1311aa
improve vram safety with 5% vram memory buffer (#724)
* check free memory not total
* wait for subprocess to exit
2023-10-10 16:16:09 -04:00
Jeffrey Morgan
ab0668293c llm: fix build on amd64 2023-10-06 14:39:54 -07:00
Bruce MacDonald
5d22319a2c
rename server subprocess (#700)
- this makes it easier to see that the subprocess is associated with ollama
2023-10-06 10:15:42 -04:00
Bruce MacDonald
d06bc0cb6e
enable q8, q5, 5_1, and f32 for linux gpu (#699) 2023-10-05 12:53:47 -04:00
Bruce MacDonald
9e2de1bd2c
increase streaming buffer size (#692) 2023-10-04 14:09:00 -04:00
Michael Yang
c02c0cd483 starcoder 2023-10-02 19:56:51 -07:00
Bruce MacDonald
b1f7123301
clean up num_gpu calculation code (#673) 2023-10-02 14:53:42 -04:00
Bruce MacDonald
1fbf3585d6
Relay default values to llama runner (#672)
* include seed in params for llama.cpp server and remove empty filter for temp

* relay default predict options to llama.cpp

- reorganize options to match predict request for readability

* omit empty stop

---------

Co-authored-by: hallh <hallh@users.noreply.github.com>
2023-10-02 14:53:16 -04:00
Bruce MacDonald
9771b1ec51
windows runner fixes (#637) 2023-09-29 11:47:55 -04:00
Michael Yang
f40b3de758 use int64 consistently 2023-09-28 11:07:24 -07:00
Bruce MacDonald
86279f4ae3
unbound max num gpu layers (#591)
---------

Co-authored-by: Michael Yang <mxyng@pm.me>
2023-09-25 18:36:46 -04:00
Michael Yang
058d0cd04b silence warm up log 2023-09-21 14:53:33 -07:00
Michael Yang
ee1c994d15
update submodule (#567) 2023-09-21 16:22:23 -04:00
Bruce MacDonald
4cba75efc5
remove tmp directories created by previous servers (#559)
* remove tmp directories created by previous servers

* clean up on server stop

* Update routes.go

* Update server/routes.go

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* create top-level temp ollama dir

* check file exists before creating

---------

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
2023-09-21 20:38:49 +01:00
Michael Yang
a9ed7cc6aa rename generate.go 2023-09-20 14:42:17 -07:00
Michael Yang
6c6a31a1e8 embed libraries using cmake 2023-09-20 14:41:57 -07:00
Bruce MacDonald
fc6ec356fc remove libcuda.so 2023-09-20 20:36:14 +01:00
Bruce MacDonald
1255bc9b45 only package 11.8 runner 2023-09-20 20:00:41 +01:00
Bruce MacDonald
b9bb5ca288 use cuda_version 2023-09-20 17:58:16 +01:00
Bruce MacDonald
4e8be787c7 pack in cuda libs 2023-09-20 17:40:42 +01:00
Bruce MacDonald
66003e1d05
subprocess improvements (#524)
* subprocess improvements

- increase start-up timeout
- when runner fails to start fail rather than timing out
- try runners in order rather than choosing 1 runner
- embed metal runner in metal dir rather than gpu
- refactor logging and error messages

* Update llama.go

* Update llama.go

* simplify by using glob
2023-09-18 15:16:32 -04:00
Bruce MacDonald
2540c9181c
support for packaging in multiple cuda runners (#509)
* enable packaging multiple cuda versions
* use nvcc cuda version if available

---------

Co-authored-by: Michael Yang <mxyng@pm.me>
2023-09-14 15:08:13 -04:00
Michael Yang
d028853879 fix: add falcon.go 2023-09-13 14:47:37 -07:00
Michael Yang
949553db23
Merge pull request #519 from jmorganca/mxyng/decode
Mxyng/decode
2023-09-13 12:43:57 -07:00
Michael Yang
0c5a454361 fix model type for 70b 2023-09-12 15:12:59 -07:00
Bruce MacDonald
f59c4d03f7
fix ggml arm64 cuda build (#520) 2023-09-12 17:06:48 -04:00