jmorganca
ddf5c09a9b
use matrix multiplcation kernels in more cases
2024-04-25 13:58:54 -07:00
Roy Yang
5f73c08729
Remove trailing spaces ( #3889 )
2024-04-25 14:32:26 -04:00
Daniel Hiltgen
6e76348df7
Merge pull request #3834 from dhiltgen/not_found_in_path
...
Report errors on server lookup instead of path lookup failure
2024-04-24 10:50:48 -07:00
Patrick Devine
14476d48cc
fixes for gguf ( #3863 )
2024-04-23 20:57:20 -07:00
Daniel Hiltgen
5445aaa94e
Add back memory escape valve
...
If we get our predictions wrong, this can be used to
set a lower memory limit as a workaround. Recent multi-gpu
refactoring accidentally removed it, so this adds it back.
2024-04-23 17:09:02 -07:00
Daniel Hiltgen
058f6cd2cc
Move nested payloads to installer and zip file on windows
...
Now that the llm runner is an executable and not just a dll, more users are facing
problems with security policy configurations on windows that prevent users
writing to directories and then executing binaries from the same location.
This change removes payloads from the main executable on windows and shifts them
over to be packaged in the installer and discovered based on the executables location.
This also adds a new zip file for people who want to "roll their own" installation model.
2024-04-23 16:14:47 -07:00
Daniel Hiltgen
58888a74bc
Detect and recover if runner removed
...
Tmp cleaners can nuke the file out from underneath us. This detects the missing
runner, and re-initializes the payloads.
2024-04-23 10:05:26 -07:00
Daniel Hiltgen
cc5a71e0e3
Merge pull request #3709 from remy415/custom-gpu-defs
...
Adds support for customizing GPU build flags in llama.cpp
2024-04-23 09:28:34 -07:00
Michael Yang
e83bcf7f9a
Merge pull request #3836 from ollama/mxyng/mixtral
...
fix: mixtral graph
2024-04-23 09:15:10 -07:00
Daniel Hiltgen
34b9db5afc
Request and model concurrency
...
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
2024-04-22 19:29:12 -07:00
Daniel Hiltgen
8711d03df7
Report errors on server lookup instead of path lookup failure
2024-04-22 19:08:47 -07:00
Michael Yang
435cc866a3
fix: mixtral graph
2024-04-22 17:19:44 -07:00
Daniel Hiltgen
aa72281eae
Trim spaces and quotes from llm lib override
2024-04-22 17:11:14 -07:00
Jeremy
9c0db4cc83
Update gen_windows.ps1
...
Fixed improper env references
2024-04-21 16:13:41 -04:00
Cheng
62be2050dd
chore: use errors.New to replace fmt.Errorf will much better ( #3789 )
2024-04-20 22:11:06 -04:00
Jeremy
6f18297b3a
Update gen_windows.ps1
...
Forgot a " on the write-host
2024-04-18 19:47:44 -04:00
Jeremy
15016413de
Update gen_windows.ps1
...
Added OLLAMA_CUSTOM_CUDA_DEFS and OLLAMA_CUSTOM_ROCM_DEFS to customize GPU builds on Windows
2024-04-18 19:27:16 -04:00
Jeremy
440b7190ed
Update gen_linux.sh
...
Added OLLAMA_CUSTOM_CUDA_DEFS and OLLAMA_CUSTOM_ROCM_DEFS instead of OLLAMA_CUSTOM_GPU_DEFS
2024-04-18 19:18:10 -04:00
Jeremy
3934c15895
Merge branch 'ollama:main' into custom-gpu-defs
2024-04-18 09:55:10 -04:00
Jeremy
fd048f1367
Merge branch 'ollama:main' into arm64static
2024-04-18 09:55:04 -04:00
Michael Yang
8645076a71
Merge pull request #3712 from ollama/mxyng/mem
...
add stablelm graph calculation
2024-04-17 15:57:51 -07:00
Michael Yang
05e9424824
Merge pull request #3664 from ollama/mxyng/fix-padding-2
...
fix padding to only return padding
2024-04-17 15:57:40 -07:00
Michael Yang
3cf483fe48
add stablelm graph calculation
2024-04-17 13:57:19 -07:00
Jeremy
52f5370c48
add support for custom gpu build flags for llama.cpp
2024-04-17 16:00:48 -04:00
Jeremy
7c000ec3ed
adds support for OLLAMA_CUSTOM_GPU_DEFS to customize GPU build flags
2024-04-17 15:21:05 -04:00
Jeremy
ea4c284a48
Merge branch 'ollama:main' into arm64static
2024-04-17 15:11:38 -04:00
Jeremy
8aec92fa6d
rearranged conditional logic for static build, dockerfile updated
2024-04-17 14:43:28 -04:00
Michael Yang
a8b9b930b4
account for all non-repeating layers
2024-04-17 11:21:21 -07:00
Jeremy
70261b9bb6
move static build to its own flag
2024-04-17 13:04:28 -04:00
Michael Yang
e74163af4c
fix padding to only return padding
2024-04-16 15:43:26 -07:00
Michael Yang
26df674785
scale graph based on gpu count
2024-04-16 14:44:13 -07:00
Jeffrey Morgan
7c9792a6e0
Support unicode characters in model path ( #3681 )
...
* parse wide argv characters on windows
* cleanup
* move cleanup to end of `main`
2024-04-16 17:00:12 -04:00
Michael Yang
41a272de9f
darwin: no partial offloading if required memory greater than system
2024-04-16 11:22:38 -07:00
Jeffrey Morgan
f335722275
update llama.cpp submodule to 7593639
( #3665 )
2024-04-15 23:04:43 -04:00
Michael Yang
6d53b67c2c
Merge pull request #3663 from ollama/mxyng/fix-padding
2024-04-15 17:44:54 -07:00
Michael Yang
969238b19e
fix padding in decode
...
TODO: update padding() to _only_ returning the padding
2024-04-15 17:27:06 -07:00
Patrick Devine
9f8691c6c8
Add llama2 / torch models for ollama create
( #3607 )
2024-04-15 11:26:42 -07:00
Jeffrey Morgan
a0b8a32eb4
Terminate subprocess if receiving SIGINT
or SIGTERM
signals while model is loading ( #3653 )
...
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading
* use `unload` in signal handler
2024-04-15 12:09:32 -04:00
Jeffrey Morgan
309aef7fee
update llama.cpp submodule to 4bd0f93
( #3627 )
2024-04-13 10:43:02 -07:00
Michael Yang
3397eff0cd
mixtral mem
2024-04-11 11:10:41 -07:00
Michael Yang
7e33a017c0
partial offloading
2024-04-10 11:37:20 -07:00
Michael Yang
8b2c10061c
refactor tensor query
2024-04-10 11:37:20 -07:00
Daniel Hiltgen
c5ff443b9f
Handle very slow model loads
...
During testing, we're seeing some models take over 3 minutes.
2024-04-09 16:35:10 -07:00
Blake Mizerany
1524f323a3
Revert "build.go: introduce a friendlier way to build Ollama ( #3548 )" ( #3564 )
2024-04-09 15:57:45 -07:00
Blake Mizerany
fccf3eecaa
build.go: introduce a friendlier way to build Ollama ( #3548 )
...
This commit introduces a more friendly way to build Ollama dependencies
and the binary without abusing `go generate` and removing the
unnecessary extra steps it brings with it.
This script also provides nicer feedback to the user about what is
happening during the build process.
At the end, it prints a helpful message to the user about what to do
next (e.g. run the new local Ollama).
2024-04-09 14:18:47 -07:00
Michael Yang
c77d45d836
Merge pull request #3506 from ollama/mxyng/quantize-redux
...
cgo quantize
2024-04-09 12:32:53 -07:00
Jeffrey Morgan
5ec12cec6c
update llama.cpp submodule to 1b67731
( #3561 )
2024-04-09 15:10:17 -04:00
Michael Yang
9502e5661f
cgo quantize
2024-04-08 15:31:08 -07:00
Jeffrey Morgan
63efa075a0
update generate scripts with new LLAMA_CUDA
variable, set HIP_PLATFORM
to avoid compiler errors ( #3528 )
2024-04-07 19:29:51 -04:00
Michael Yang
be517e491c
no rope parameters
2024-04-05 18:05:27 -07:00