Sam
e15307fdf4
feat: add support for flash_attn ( #4120 )
...
* feat: enable flash attention if supported
* feat: enable flash attention if supported
* feat: enable flash attention if supported
* feat: add flash_attn support
2024-05-20 13:36:03 -07:00
Jeffrey Morgan
583c1f472c
update llama.cpp submodule to 614d3b9
( #4414 )
2024-05-16 13:53:09 -07:00
Daniel Hiltgen
c48c1d7c46
Port cuda/rocm skip build vars to linux
...
Windows already implements these, carry over to linux.
2024-05-15 15:56:43 -07:00
Patrick Devine
d1692fd3e0
fix the cpu estimatedTotal memory + get the expiry time for loading models ( #4461 )
2024-05-15 15:43:16 -07:00
Daniel Hiltgen
853ae490e1
Sanitize the env var debug log
...
Only dump env vars we care about in the logs
2024-05-15 14:42:57 -07:00
Michael Yang
0e331c7168
Merge pull request #4328 from ollama/mxyng/mem
...
count memory up to NumGPU if set by user
2024-05-14 13:47:44 -07:00
Patrick Devine
6845988807
Ollama ps
command for showing currently loaded models ( #4327 )
2024-05-13 17:17:36 -07:00
Michael Yang
1d359e737e
typo
2024-05-13 14:18:34 -07:00
Michael Yang
50b9056e09
count memory up to NumGPU
2024-05-13 14:13:10 -07:00
jmorganca
92ca2cca95
Revert "only forward some env vars"
...
This reverts commit ce3b212d12
.
2024-05-10 22:53:21 -07:00
Daniel Hiltgen
c4014e73a2
Fall back to CPU runner with zero layers
2024-05-10 15:09:48 -07:00
Michael Yang
1eb382da5a
add phi2 mem
2024-05-10 12:13:28 -07:00
Jeffrey Morgan
bb6fd02298
Don't clamp ctx size in PredictServerFit
( #4317 )
...
* dont clamp ctx size in `PredictServerFit`
* minimum 4 context
* remove context warning
2024-05-10 10:17:12 -07:00
Michael Yang
cf442cd57e
fix typo
2024-05-09 16:23:37 -07:00
Michael Yang
ce3b212d12
only forward some env vars
2024-05-09 15:16:09 -07:00
Michael Yang
58876091f7
log clean up
2024-05-09 14:55:36 -07:00
Daniel Hiltgen
d0425f26cf
Merge pull request #4294 from dhiltgen/harden_subprocess_reaping
...
Harden subprocess reaping
2024-05-09 14:02:16 -07:00
Bruce MacDonald
cfa84b8470
add done_reason to the api ( #4235 )
2024-05-09 13:30:14 -07:00
Daniel Hiltgen
84ac7ce139
Refine subprocess reaping
2024-05-09 11:21:31 -07:00
Daniel Hiltgen
920a4b0794
Merge remote-tracking branch 'upstream/main' into pr3702
2024-05-08 16:44:35 -07:00
Daniel Hiltgen
ee49844d09
Merge pull request #4153 from dhiltgen/gpu_verbose_response
...
Add GPU usage
2024-05-08 16:39:11 -07:00
Daniel Hiltgen
8a516ac862
Merge pull request #4241 from dhiltgen/fix_tmp_override
...
Detect noexec and report a better error
2024-05-08 15:34:22 -07:00
Daniel Hiltgen
bee2f4a3b0
Record GPU usage information
...
This records more GPU usage information for eventual UX inclusion.
2024-05-08 14:45:39 -07:00
Michael Yang
eeb695261f
skip if same quantization
2024-05-07 17:44:19 -07:00
Daniel Hiltgen
72700279e2
Detect noexec and report a better error
...
This will bubble up a much more informative error message if noexec
is preventing us from running the subprocess
2024-05-07 16:46:15 -07:00
Michael Yang
1e0a669f75
Merge pull request #3682 from ollama/mxyng/quantize-all-the-things
...
quantize any fp16/fp32 model
2024-05-07 15:20:49 -07:00
Michael Yang
4736391bfb
llm: add minimum based on layer size
2024-05-06 17:04:19 -07:00
Michael Yang
01811c176a
comments
2024-05-06 15:24:01 -07:00
Michael Yang
9685c34509
quantize any fp16/fp32 model
...
- FROM /path/to/{safetensors,pytorch}
- FROM /path/to/fp{16,32}.bin
- FROM model:fp{16,32}
2024-05-06 15:24:01 -07:00
Daniel Hiltgen
380378cc80
Use our libraries first
...
Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly
2024-05-06 14:23:29 -07:00
Jeffrey Morgan
ed740a2504
Fix no slots available
error with concurrent requests ( #4160 )
2024-05-06 14:22:53 -07:00
Jeffrey Morgan
1b0e6c9c0e
Fix llava models not working after first request ( #4164 )
...
* fix llava models not working after first request
* individual requests only for llava models
2024-05-05 20:50:31 -07:00
Daniel Hiltgen
f56aa20014
Centralize server config handling
...
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
2024-05-05 16:49:50 -07:00
Michael Yang
44869c59d6
omit prompt and generate settings from final response
2024-05-03 17:00:02 -07:00
Mark Ward
321d57e1a0
Removing go routine calling .wait from load.
2024-05-01 18:51:10 +00:00
Mark Ward
ba26c7aa00
it will always return an error due to Kill() discarding Wait() errors
2024-05-01 18:51:10 +00:00
Mark Ward
63c763685f
log when the waiting for the process to stop to help debug when other tasks execute during this wait.
...
expire timer clear the timer reference because it will not be reused.
close will clean up expireTimer if calling code has not already done this.
2024-05-01 18:51:10 +00:00
Mark Ward
948114e3e3
fix sched to wait for the runner to terminate to ensure following vram check will be more accurate
2024-05-01 18:51:10 +00:00
Jeffrey Morgan
f0c454ab57
gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead ( #4068 )
2024-05-01 11:46:03 -04:00
jmorganca
fcf4d60eee
llm: add back check for empty token cache
2024-04-30 17:38:44 -04:00
jmorganca
e33d5c2dbc
update llama.cpp commit to 952d03d
2024-04-30 17:31:20 -04:00
Jeffrey Morgan
18d9a7e1f1
update llama.cpp submodule to f364eb6
( #4060 )
2024-04-30 17:25:39 -04:00
Daniel Hiltgen
23d23409a0
Update llama.cpp ( #4036 )
...
* Bump llama.cpp to b2761
* Adjust types for bump
2024-04-29 23:18:48 -04:00
Jeffrey Morgan
7aa08a77ca
llm: dont cap context window limit to training context window ( #3988 )
2024-04-29 10:07:30 -04:00
Hernan Martinez
8a65717f55
Do not build AVX runners on ARM64
2024-04-26 23:55:32 -06:00
Hernan Martinez
b438d485f1
Use architecture specific folders in the generate script
2024-04-26 23:34:12 -06:00
Hernan Martinez
86e67fc4a9
Add import declaration for windows,arm64 to llm.go
2024-04-26 23:23:53 -06:00
Daniel Hiltgen
e4859c4563
Fine grain control over windows generate steps
...
This will speed up CI which already tries to only build static for unit tests
2024-04-26 15:49:46 -07:00
Daniel Hiltgen
0b5c589ca2
Merge pull request #3966 from dhiltgen/bump
...
Fix target in gen_windows.ps1
2024-04-26 15:36:53 -07:00
Michael Yang
65fadddc85
Merge pull request #3964 from ollama/mxyng/weights
...
fix gemma, command-r layer weights
2024-04-26 15:23:33 -07:00
Daniel Hiltgen
ed5fb088c4
Fix target in gen_windows.ps1
2024-04-26 15:10:42 -07:00
Michael Yang
f81f308118
fix gemma, command-r layer weights
2024-04-26 15:00:55 -07:00
Jeffrey Morgan
bb31def011
return code 499
when user cancels request while a model is loading ( #3955 )
2024-04-26 17:38:29 -04:00
Daniel Hiltgen
5c0c2d1d09
Merge pull request #3954 from dhiltgen/ci_fixes
...
Put back non-avx CPU build for windows
2024-04-26 13:09:03 -07:00
Daniel Hiltgen
421c878a2d
Put back non-avx CPU build for windows
2024-04-26 12:44:07 -07:00
Daniel Hiltgen
85801317d1
Fix clip log import
2024-04-26 09:43:46 -07:00
Daniel Hiltgen
2ed0d65948
Bump llama.cpp to b2737
2024-04-26 09:43:28 -07:00
Daniel Hiltgen
8671fdeda6
Refactor windows generate for more modular usage
2024-04-26 08:35:50 -07:00
Daniel Hiltgen
8feb97dc0d
Move cuda/rocm dependency gathering into generate script
...
This will make it simpler for CI to accumulate artifacts from prior steps
2024-04-25 22:38:44 -07:00
Michael Yang
de4ded68b0
Merge pull request #3923 from ollama/mxyng/mem
...
only count output tensors
2024-04-25 16:34:17 -07:00
Daniel Hiltgen
9b5a3c5991
Merge pull request #3914 from dhiltgen/mac_perf
...
Improve mac parallel performance
2024-04-25 16:28:31 -07:00
Jeffrey Morgan
993cf8bf55
llm: limit generation to 10x context size to avoid run on generations ( #3918 )
...
* llm: limit generation to 10x context size to avoid run on generations
* add comment
* simplify condition statement
2024-04-25 19:02:30 -04:00
Michael Yang
7bb7cb8a60
only count output tensors
2024-04-25 15:24:08 -07:00
jmorganca
ddf5c09a9b
use matrix multiplcation kernels in more cases
2024-04-25 13:58:54 -07:00
Roy Yang
5f73c08729
Remove trailing spaces ( #3889 )
2024-04-25 14:32:26 -04:00
Daniel Hiltgen
6e76348df7
Merge pull request #3834 from dhiltgen/not_found_in_path
...
Report errors on server lookup instead of path lookup failure
2024-04-24 10:50:48 -07:00
Patrick Devine
14476d48cc
fixes for gguf ( #3863 )
2024-04-23 20:57:20 -07:00
Daniel Hiltgen
5445aaa94e
Add back memory escape valve
...
If we get our predictions wrong, this can be used to
set a lower memory limit as a workaround. Recent multi-gpu
refactoring accidentally removed it, so this adds it back.
2024-04-23 17:09:02 -07:00
Daniel Hiltgen
058f6cd2cc
Move nested payloads to installer and zip file on windows
...
Now that the llm runner is an executable and not just a dll, more users are facing
problems with security policy configurations on windows that prevent users
writing to directories and then executing binaries from the same location.
This change removes payloads from the main executable on windows and shifts them
over to be packaged in the installer and discovered based on the executables location.
This also adds a new zip file for people who want to "roll their own" installation model.
2024-04-23 16:14:47 -07:00
Daniel Hiltgen
58888a74bc
Detect and recover if runner removed
...
Tmp cleaners can nuke the file out from underneath us. This detects the missing
runner, and re-initializes the payloads.
2024-04-23 10:05:26 -07:00
Daniel Hiltgen
cc5a71e0e3
Merge pull request #3709 from remy415/custom-gpu-defs
...
Adds support for customizing GPU build flags in llama.cpp
2024-04-23 09:28:34 -07:00
Michael Yang
e83bcf7f9a
Merge pull request #3836 from ollama/mxyng/mixtral
...
fix: mixtral graph
2024-04-23 09:15:10 -07:00
Daniel Hiltgen
34b9db5afc
Request and model concurrency
...
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
2024-04-22 19:29:12 -07:00
Daniel Hiltgen
8711d03df7
Report errors on server lookup instead of path lookup failure
2024-04-22 19:08:47 -07:00
Michael Yang
435cc866a3
fix: mixtral graph
2024-04-22 17:19:44 -07:00
Daniel Hiltgen
aa72281eae
Trim spaces and quotes from llm lib override
2024-04-22 17:11:14 -07:00
Jeremy
9c0db4cc83
Update gen_windows.ps1
...
Fixed improper env references
2024-04-21 16:13:41 -04:00
Cheng
62be2050dd
chore: use errors.New to replace fmt.Errorf will much better ( #3789 )
2024-04-20 22:11:06 -04:00
Jeremy
6f18297b3a
Update gen_windows.ps1
...
Forgot a " on the write-host
2024-04-18 19:47:44 -04:00
Jeremy
15016413de
Update gen_windows.ps1
...
Added OLLAMA_CUSTOM_CUDA_DEFS and OLLAMA_CUSTOM_ROCM_DEFS to customize GPU builds on Windows
2024-04-18 19:27:16 -04:00
Jeremy
440b7190ed
Update gen_linux.sh
...
Added OLLAMA_CUSTOM_CUDA_DEFS and OLLAMA_CUSTOM_ROCM_DEFS instead of OLLAMA_CUSTOM_GPU_DEFS
2024-04-18 19:18:10 -04:00
ManniX-ITA
c496967e56
Merge branch 'ollama:main' into mannix-server
2024-04-18 18:45:15 +02:00
Jeremy
3934c15895
Merge branch 'ollama:main' into custom-gpu-defs
2024-04-18 09:55:10 -04:00
Jeremy
fd048f1367
Merge branch 'ollama:main' into arm64static
2024-04-18 09:55:04 -04:00
Michael Yang
8645076a71
Merge pull request #3712 from ollama/mxyng/mem
...
add stablelm graph calculation
2024-04-17 15:57:51 -07:00
Michael Yang
05e9424824
Merge pull request #3664 from ollama/mxyng/fix-padding-2
...
fix padding to only return padding
2024-04-17 15:57:40 -07:00
Michael Yang
3cf483fe48
add stablelm graph calculation
2024-04-17 13:57:19 -07:00
Jeremy
52f5370c48
add support for custom gpu build flags for llama.cpp
2024-04-17 16:00:48 -04:00
Jeremy
7c000ec3ed
adds support for OLLAMA_CUSTOM_GPU_DEFS to customize GPU build flags
2024-04-17 15:21:05 -04:00
Jeremy
ea4c284a48
Merge branch 'ollama:main' into arm64static
2024-04-17 15:11:38 -04:00
Jeremy
8aec92fa6d
rearranged conditional logic for static build, dockerfile updated
2024-04-17 14:43:28 -04:00
Michael Yang
a8b9b930b4
account for all non-repeating layers
2024-04-17 11:21:21 -07:00
Jeremy
70261b9bb6
move static build to its own flag
2024-04-17 13:04:28 -04:00
ManniX-ITA
c942e4a07b
Fixed startup sequence to report model loading
2024-04-17 17:40:32 +02:00
ManniX-ITA
bd54b08261
Streamlined WaitUntilRunning
2024-04-17 17:39:52 +02:00
Michael Yang
e74163af4c
fix padding to only return padding
2024-04-16 15:43:26 -07:00
Michael Yang
26df674785
scale graph based on gpu count
2024-04-16 14:44:13 -07:00
Jeffrey Morgan
7c9792a6e0
Support unicode characters in model path ( #3681 )
...
* parse wide argv characters on windows
* cleanup
* move cleanup to end of `main`
2024-04-16 17:00:12 -04:00
Michael Yang
41a272de9f
darwin: no partial offloading if required memory greater than system
2024-04-16 11:22:38 -07:00
Jeffrey Morgan
f335722275
update llama.cpp submodule to 7593639
( #3665 )
2024-04-15 23:04:43 -04:00
Michael Yang
6d53b67c2c
Merge pull request #3663 from ollama/mxyng/fix-padding
2024-04-15 17:44:54 -07:00
Michael Yang
969238b19e
fix padding in decode
...
TODO: update padding() to _only_ returning the padding
2024-04-15 17:27:06 -07:00
Patrick Devine
9f8691c6c8
Add llama2 / torch models for ollama create
( #3607 )
2024-04-15 11:26:42 -07:00
Jeffrey Morgan
a0b8a32eb4
Terminate subprocess if receiving SIGINT
or SIGTERM
signals while model is loading ( #3653 )
...
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading
* use `unload` in signal handler
2024-04-15 12:09:32 -04:00
Jeffrey Morgan
309aef7fee
update llama.cpp submodule to 4bd0f93
( #3627 )
2024-04-13 10:43:02 -07:00
Michael Yang
3397eff0cd
mixtral mem
2024-04-11 11:10:41 -07:00
Michael Yang
7e33a017c0
partial offloading
2024-04-10 11:37:20 -07:00
Michael Yang
8b2c10061c
refactor tensor query
2024-04-10 11:37:20 -07:00
Daniel Hiltgen
c5ff443b9f
Handle very slow model loads
...
During testing, we're seeing some models take over 3 minutes.
2024-04-09 16:35:10 -07:00
Blake Mizerany
1524f323a3
Revert "build.go: introduce a friendlier way to build Ollama ( #3548 )" ( #3564 )
2024-04-09 15:57:45 -07:00
Blake Mizerany
fccf3eecaa
build.go: introduce a friendlier way to build Ollama ( #3548 )
...
This commit introduces a more friendly way to build Ollama dependencies
and the binary without abusing `go generate` and removing the
unnecessary extra steps it brings with it.
This script also provides nicer feedback to the user about what is
happening during the build process.
At the end, it prints a helpful message to the user about what to do
next (e.g. run the new local Ollama).
2024-04-09 14:18:47 -07:00
Michael Yang
c77d45d836
Merge pull request #3506 from ollama/mxyng/quantize-redux
...
cgo quantize
2024-04-09 12:32:53 -07:00
Jeffrey Morgan
5ec12cec6c
update llama.cpp submodule to 1b67731
( #3561 )
2024-04-09 15:10:17 -04:00
Michael Yang
9502e5661f
cgo quantize
2024-04-08 15:31:08 -07:00
Jeffrey Morgan
63efa075a0
update generate scripts with new LLAMA_CUDA
variable, set HIP_PLATFORM
to avoid compiler errors ( #3528 )
2024-04-07 19:29:51 -04:00
Michael Yang
be517e491c
no rope parameters
2024-04-05 18:05:27 -07:00
Michael Yang
fc8e108642
Merge pull request #3496 from ollama/mxyng/cmd-r-graph
...
add command-r graph estimate
2024-04-05 12:26:21 -07:00
Daniel Hiltgen
dfe330fa1c
Merge pull request #3488 from mofanke/fix-windows-dll-compress
...
fix dll compress in windows building
2024-04-04 16:12:13 -07:00
Michael Yang
01f77ae25d
add command-r graph estimate
2024-04-04 14:07:24 -07:00
Daniel Hiltgen
36bd967722
Fail fast if mingw missing on windows
2024-04-04 09:51:26 -07:00
mofanke
4de0126719
fix dll compress in windows building
2024-04-04 21:27:33 +08:00
Daniel Hiltgen
e4a7e5b2ca
Fix CI release glitches
...
The subprocess change moved the build directory
arm64 builds weren't setting cross-compilation flags when building on x86
2024-04-03 16:41:40 -07:00
Michael Yang
12e923e158
update graph size estimate
2024-04-03 13:34:12 -07:00
Jeffrey Morgan
cd135317d2
Fix macOS builds on older SDKs ( #3467 )
2024-04-03 10:45:54 -07:00
Michael Yang
4f895d633f
Merge pull request #3466 from ollama/mxyng/head-kv
...
default head_kv to 1
2024-04-03 10:41:00 -07:00
Daniel Hiltgen
464d817824
Merge pull request #3464 from dhiltgen/subprocess
...
Fix numgpu opt miscomparison
2024-04-02 20:10:17 -07:00
Daniel Hiltgen
6589eb8a8c
Revert options as a ref in the server
2024-04-02 16:44:10 -07:00
Michael Yang
90f071c658
default head_kv to 1
2024-04-02 16:37:59 -07:00
Michael Yang
80163ebcb5
fix metal gpu
2024-04-02 16:06:45 -07:00
Daniel Hiltgen
0035e31af8
Bump to b2581
2024-04-02 11:53:07 -07:00
Daniel Hiltgen
0a0e9f3e0f
Apply 01-cache.diff
2024-04-01 16:48:18 -07:00
Daniel Hiltgen
58d95cc9bd
Switch back to subprocessing for llama.cpp
...
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00
Michael Yang
91b3e4d282
update memory calcualtions
...
count each layer independently when deciding gpu offloading
2024-04-01 13:16:32 -07:00
Michael Yang
d338d70492
refactor model parsing
2024-04-01 13:16:15 -07:00
Patrick Devine
5a5efee46b
Add gemma safetensors conversion ( #3250 )
...
Co-authored-by: Michael Yang <mxyng@pm.me>
2024-03-28 18:54:01 -07:00
Jeffrey Morgan
f5ca7f8c8e
add license in file header for vendored llama.cpp code ( #3351 )
2024-03-26 16:23:23 -04:00
Jeffrey Morgan
856b8ec131
remove need for $VSINSTALLDIR
since build will fail if ninja
cannot be found ( #3350 )
2024-03-26 16:23:16 -04:00
Patrick Devine
1b272d5bcd
change github.com/jmorganca/ollama
to github.com/ollama/ollama
( #3347 )
2024-03-26 13:04:17 -07:00
Daniel Hiltgen
8091ef2eeb
Bump llama.cpp to b2527
2024-03-25 13:47:44 -07:00
Daniel Hiltgen
560be5e0b6
Merge pull request #3308 from dhiltgen/bump_more
...
Bump llama.cpp to b2510
2024-03-25 12:56:12 -07:00
Jeremy
dfc6721b20
add support for libcudart.so for CUDA devices (adds Jetson support)
2024-03-25 11:07:44 -04:00
Blake Mizerany
acfa2b9422
llm: prevent race appending to slice ( #3320 )
2024-03-24 11:35:54 -07:00
Daniel Hiltgen
3e30c75f3e
Bump llama.cpp to b2510
2024-03-23 19:55:56 +01:00
Daniel Hiltgen
43799532c1
Bump llama.cpp to b2474
...
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Daniel Hiltgen
74788b487c
Better tmpdir cleanup
...
If expanding the runners fails, don't leave a corrupt/incomplete payloads dir
We now write a pid file out to the tmpdir, which allows us to scan for stale tmpdirs
and remove this as long as there isn't still a process running.
2024-03-20 16:03:19 +01:00
Michael Yang
3c4ad0ecab
dyn global
2024-03-18 09:45:45 +01:00
Michael Yang
22f326464e
Merge pull request #3083 from ollama/mxyng/refactor-readseeker
...
refactor readseeker
2024-03-16 12:08:56 -07:00
Jeffrey Morgan
e95ffc7448
llama: remove server static assets ( #3174 )
2024-03-15 19:24:12 -07:00
Daniel Hiltgen
ab3456207b
Merge pull request #3028 from ollama/ci_release
...
CI release process
2024-03-15 16:40:54 -07:00
Daniel Hiltgen
6ad414f31e
Merge pull request #3086 from dhiltgen/import_server
...
Import server.cpp to retain llava support
2024-03-15 16:10:35 -07:00
Daniel Hiltgen
d4c10df2b0
Add Radeon gfx940-942 GPU support
2024-03-15 15:34:58 -07:00
Daniel Hiltgen
540f4af45f
Wire up more complete CI for releases
...
Flesh out our github actions CI so we can build official releaes.
2024-03-15 12:37:36 -07:00
Blake Mizerany
6ce37e4d96
llm,readline: use errors.Is instead of simple == check ( #3161 )
...
This fixes some brittle, simple equality checks to use errors.Is. Since
go1.13, errors.Is is the idiomatic way to check for errors.
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
2024-03-15 07:14:12 -07:00
Michael Yang
291c663865
fix: clip memory leak
2024-03-14 13:12:42 -07:00
Jeffrey Morgan
e72c567cfd
restore locale patch ( #3091 )
2024-03-12 22:08:13 -07:00
Bruce MacDonald
3e22611200
token repeat limit for prediction requests ( #3080 )
2024-03-12 22:08:25 -04:00
Bruce MacDonald
2f804068bd
warn when json format is expected but not mentioned in prompt ( #3081 )
2024-03-12 19:07:11 -04:00
Daniel Hiltgen
85129d3a32
Adapt our build for imported server.cpp
2024-03-12 14:57:15 -07:00
Daniel Hiltgen
9ac6440da3
Import server.cpp as of b2356
2024-03-12 13:58:06 -07:00
Michael Yang
0085297928
refactor readseeker
2024-03-12 12:54:18 -07:00
racerole
53c107e20e
chore: fix typo ( #3073 )
...
Signed-off-by: racerole <jiangyifeng@outlook.com>
2024-03-12 14:09:22 -04:00
Bruce MacDonald
b80661e8c7
relay load model errors to the client ( #3065 )
2024-03-11 16:48:27 -04:00
Jeffrey Morgan
369eda65f5
update llama.cpp submodule to ceca1ae
( #3064 )
2024-03-11 12:57:48 -07:00
Daniel Hiltgen
bc13da2bfe
Avoid rocm runner and dependency clash
...
Putting the rocm symlink next to the runners is risky. This moves
the payloads into a subdir to avoid potential clashes.
2024-03-11 09:33:22 -07:00
Jeffrey Morgan
41b00b9856
fix 03-locale.diff
2024-03-10 16:21:05 -07:00
Daniel Hiltgen
3dc1bb6a35
Harden for deps file being empty (or short)
2024-03-10 14:45:38 -07:00
Jeffrey Morgan
908005d90b
patch: use default locale in wpm tokenizer ( #3034 )
2024-03-09 21:12:12 -08:00
Jeffrey Morgan
e11668aa07
add bundle_metal
and cleanup_metal
funtions to gen_darwin.sh
2024-03-09 16:04:57 -08:00
Jeffrey Morgan
1ffb1e2874
update llama.cpp submodule to 77d1ac7
( #3030 )
2024-03-09 15:55:34 -08:00
Jeffrey Morgan
f9cd55c70b
disable gpu for certain model architectures and fix divide-by-zero on memory estimation
2024-03-09 12:51:38 -08:00
Daniel Hiltgen
4a5c9b8035
Finish unwinding idempotent payload logic
...
The recent ROCm change partially removed idempotent
payloads, but the ggml-metal.metal file for mac was still
idempotent. This finishes switching to always extract
the payloads, and now that idempotentcy is gone, the
version directory is no longer useful.
2024-03-09 08:34:39 -08:00
Jeffrey Morgan
efe5617b64
update llama.cpp submodule to c2101a2
( #3020 )
2024-03-09 00:44:50 -08:00
Michael Yang
76bdebbadf
decode ggla
2024-03-08 15:46:25 -08:00
Jeffrey Morgan
0e4669b04f
update llama.cpp submodule to 6cdabe6
( #2999 )
2024-03-08 00:26:20 -08:00
Daniel Hiltgen
6c5ccb11f9
Revamp ROCm support
...
This refines where we extract the LLM libraries to by adding a new
OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already
idempotenent, so this should speed up startups after the first time a
new release is deployed. It also cleans up after itself.
We now build only a single ROCm version (latest major) on both windows
and linux. Given the large size of ROCms tensor files, we split the
dependency out. It's bundled into the installer on windows, and a
separate download on windows. The linux install script is now smart and
detects the presence of AMD GPUs and looks to see if rocm v6 is already
present, and if not, then downloads our dependency tar file.
For Linux discovery, we now use sysfs and check each GPU against what
ROCm supports so we can degrade to CPU gracefully instead of having
llama.cpp+rocm assert/crash on us. For Windows, we now use go's windows
dynamic library loading logic to access the amdhip64.dll APIs to query
the GPU information.
2024-03-07 10:36:50 -08:00
John
23ebe8fe11
fix some typos ( #2973 )
...
Signed-off-by: hishope <csqiye@126.com>
2024-03-06 22:50:11 -08:00
Patrick Devine
2c017ca441
Convert Safetensors to an Ollama model ( #2824 )
2024-03-06 21:01:51 -08:00
Jeffrey Morgan
21347e1ed6
update llama.cpp submodule to c29af7e
( #2868 )
2024-03-01 15:26:04 -08:00
Daniel Hiltgen
bd1d8b0d14
Merge pull request #2836 from bmwiedemann/gzip
...
Omit build date from gzip headers
2024-02-29 15:46:46 -08:00
Jeffrey Morgan
cbf4970e0f
bump submodule to 87c91c07663b707e831c59ec373b5e665ff9d64a
( #2828 )
2024-02-29 09:42:08 -08:00
Bernhard M. Wiedemann
76e5d9ec88
Omit build date from gzip headers
...
See https://reproducible-builds.org/ for why this is good.
This patch was done while working on reproducible builds for openSUSE.
2024-02-29 16:48:19 +01:00
Daniel Hiltgen
061e8f6abc
Bump llama.cpp to b2276
2024-02-26 16:49:24 -08:00
Jeffrey Morgan
11bfff8ee1
update llama.cpp submodule to 96633eeca1265ed03e57230de54032041c58f9cd
2024-02-22 16:44:26 -05:00
Jeffrey Morgan
efe040f8c0
reset with init_vars
ahead of each cpu build in gen_windows.ps1
( #2654 )
2024-02-21 16:35:34 -05:00
Jeffrey Morgan
2a7553ce09
update llama.cpp submodule to c14f72d
2024-02-21 09:03:14 -05:00
Jeffrey Morgan
b3eac61cac
update llama.cpp submodule to f0d1fafc029a056cd765bdae58dcaa12312e9879
2024-02-20 22:56:51 -05:00
Michael Yang
949d7b1c48
add gguf file types ( #2532 )
2024-02-20 19:06:29 -05:00
Jeffrey Morgan
4613a080e7
update llama.cpp submodule to 66c1968f7
( #2618 )
2024-02-20 17:42:31 -05:00
Taras Tsugrii
01ff2e14db
[nit] Remove unused msg local var. ( #2511 )
2024-02-20 14:02:34 -05:00
Daniel Hiltgen
4fcbf1cde6
Merge pull request #2599 from dhiltgen/fix_avx
...
Explicitly disable AVX2 on GPU builds
2024-02-19 13:13:05 -08:00
Daniel Hiltgen
9220b4fa91
Merge pull request #2585 from dhiltgen/cuda_leaks
...
Fix cuda leaks
2024-02-19 12:48:00 -08:00
Daniel Hiltgen
fc39a6cd7a
Fix cuda leaks
...
This should resolve the problem where we don't fully unload from the GPU
when we go idle.
2024-02-18 18:37:20 -08:00
Daniel Hiltgen
df6dc4fd96
Fix duplicate menus on update and exit on signals
...
Also fixes a few fit-and-finish items for better developer experience
2024-02-16 15:33:16 -08:00
Daniel Hiltgen
db2a9ad1fe
Explicitly disable AVX2 on GPU builds
...
Even though we weren't setting it to on, somewhere in the cmake config
it was getting toggled on. By explicitly setting it to off, we get `/arch:AVX`
as intended.
2024-02-15 14:50:11 -08:00
Daniel Hiltgen
29e90cc13b
Implement new Go based Desktop app
...
This focuses on Windows first, but coudl be used for Mac
and possibly linux in the future.
2024-02-15 05:56:45 +00:00
Jeffrey Morgan
9241a29336
Revert "Revert "bump submodule to 6c00a06
( #2479 )"" ( #2485 )
...
This reverts commit 6920964b87
.
2024-02-13 18:18:41 -08:00
Jeffrey Morgan
f7231ad9ad
set shutting_down
to false
once shutdown is complete ( #2484 )
2024-02-13 17:48:41 -08:00
Jeffrey Morgan
6920964b87
Revert "bump submodule to 6c00a06
( #2479 )"
...
This reverts commit 2f9ed52bbd
.
2024-02-13 17:23:05 -08:00
Jeffrey Morgan
2f9ed52bbd
bump submodule to 6c00a06
( #2479 )
2024-02-13 17:12:42 -08:00
Daniel Hiltgen
939c60473f
Merge pull request #2422 from dhiltgen/better_kill
...
More robust shutdown
2024-02-12 14:05:06 -08:00
Jeffrey Morgan
f76ca04f9e
update submodule to 099afc6
( #2468 )
2024-02-12 14:01:16 -08:00
Daniel Hiltgen
76b8728f0c
Merge pull request #2465 from dhiltgen/block_rocm_pre_9
...
Detect AMD GPU info via sysfs and block old cards
2024-02-12 12:41:43 -08:00
Daniel Hiltgen
6d84f07505
Detect AMD GPU info via sysfs and block old cards
...
This wires up some new logic to start using sysfs to discover AMD GPU
information and detects old cards we can't yet support so we can fallback to CPU mode.
2024-02-12 08:19:41 -08:00
Jeffrey Morgan
26b13fc33c
patch: always add token to cache_tokens ( #2459 )
2024-02-12 08:10:16 -08:00
Daniel Hiltgen
6680761596
Shutdown faster
...
Make sure that when a shutdown signal comes, we shutdown quickly instead
of waiting for a potentially long exchange to wrap up.
2024-02-08 22:22:50 -08:00
Daniel Hiltgen
a1dfab43b9
Ensure the libraries are present
...
When we store our libraries in a temp dir, a reaper might clean
them when we are idle, so make sure to check for them before
we reload.
2024-02-07 17:27:49 -08:00
Daniel Hiltgen
de76b95dd4
Bump llama.cpp to b2081
2024-02-06 12:06:43 -08:00
Daniel Hiltgen
27aa2d4a19
Merge pull request #1849 from mraiser/main
...
Accomodate split cuda lib dir
2024-02-05 16:01:16 -08:00
Daniel Hiltgen
e1f50377f4
Harden generate patching model
...
Only apply patches if we have any, and make sure to cleanup
every file we patched at the end to leave the tree clean
2024-02-01 19:34:36 -08:00
Jeffrey Morgan
f11bf0740b
use llm.ImageData
2024-01-31 19:13:48 -08:00
Michael Yang
8450bf66e6
trim images
2024-01-31 19:13:47 -08:00
Daniel Hiltgen
72b12c3be7
Bump llama.cpp to b1999
...
This requires an upstream change to support graceful termination,
carried as a patch.
2024-01-30 16:52:12 -08:00
Jeffrey Morgan
2e06ed01d5
remove unknown CPPFLAGS
option
2024-01-28 17:51:23 -08:00
mraiser
4c4c730a0a
Merge branch 'ollama:main' into main
2024-01-27 21:56:11 -05:00
Daniel Hiltgen
e02ecfb6c8
Merge pull request #2116 from dhiltgen/cc_50_80
...
Add support for CUDA 5.0 cards
2024-01-27 10:28:38 -08:00
Jeffrey Morgan
3ebd6a83fc
update submodule to cd4fddb29f81d6a1f6d51a0c016bc6b486d68def
2024-01-25 13:54:11 -08:00
Jeffrey Morgan
a64570dcae
Fix clearing kv cache between requests with the same prompt ( #2186 )
...
* Fix clearing kv cache between requests with the same prompt
* fix powershell script
2024-01-25 13:46:20 -08:00
mraiser
a4564232a4
Update gen_linux.sh to find libcudart in separate directory
2024-01-25 09:49:35 -05:00
Michael Yang
cd22855ef8
refactor tensor read
2024-01-24 10:48:31 -08:00
Jeffrey Morgan
4458efb73a
Load all layers on arm64
macOS if model is small enough ( #2149 )
2024-01-22 17:40:06 -08:00
Daniel Hiltgen
0f5b843319
Refine Accelerate usage on mac
...
For old macs, accelerate seems to cause crashes, but for
AVX2 capable macs, it does not.
2024-01-22 16:25:56 -08:00
Jeffrey Morgan
ffaf52e1e9
update submodule to 011e8ec577fd135cbc02993d3ea9840c516d6a1c
2024-01-22 15:16:54 -08:00
Daniel Hiltgen
3bc28736cd
Merge pull request #2143 from dhiltgen/llm_verbosity
...
Refine debug logging for llm
2024-01-22 13:19:16 -08:00
Daniel Hiltgen
730dcfcc7a
Refine debug logging for llm
...
This wires up logging in llama.cpp to always go to stderr, and also
turns up logging if OLLAMA_DEBUG is set.
2024-01-22 12:26:49 -08:00
Daniel Hiltgen
27a2d5af54
Debug logging on init failure
2024-01-22 12:08:22 -08:00
Jeffrey Morgan
5f81a33f43
update submodule to 6f9939d
( #2115 )
2024-01-22 11:56:40 -08:00
Daniel Hiltgen
5576bb2348
Merge pull request #2130 from dhiltgen/more_faster
...
Make CPU builds parallel and customizable AMD GPUs
2024-01-21 16:14:12 -08:00
Daniel Hiltgen
ec3764538d
Probe GPUs before backend init
...
Detect potential error scenarios so we can fallback to CPU mode without
hitting asserts.
2024-01-21 15:59:38 -08:00
Daniel Hiltgen
df54c723ae
Make CPU builds parallel and customizable AMD GPUs
...
The linux build now support parallel CPU builds to speed things up.
This also exposes AMD GPU targets as an optional setting for advaced
users who want to alter our default set.
2024-01-21 15:12:21 -08:00
Jeffrey Morgan
89c4aee29e
Unlock mutex when failing to load model ( #2117 )
2024-01-20 20:54:46 -05:00
Daniel Hiltgen
a447a083f2
Add compute capability 5.0, 7.5, and 8.0
2024-01-20 14:24:05 -08:00
Daniel Hiltgen
681a914990
Add support for CUDA 5.2 cards
2024-01-20 10:48:43 -08:00
Jeffrey Morgan
4c54f0ddeb
sign dylibs on macOS ( #2101 )
2024-01-19 19:24:11 -05:00
Daniel Hiltgen
6a042438af
Switch to local dlopen symbols
2024-01-19 11:37:02 -08:00
Jeffrey Morgan
dc88cc3981
use gzip
for runner embedding ( #2067 )
2024-01-19 13:23:03 -05:00
Daniel Hiltgen
abec7f06e5
Merge pull request #2056 from dhiltgen/slog
...
Mechanical switch from log to slog
2024-01-18 14:27:24 -08:00
Daniel Hiltgen
fedd705aea
Mechanical switch from log to slog
...
A few obvious levels were adjusted, but generally everything mapped to "info" level.
2024-01-18 14:12:57 -08:00
Daniel Hiltgen
fccdf4c635
Merge pull request #1987 from xyproto/archlinux
...
Let gpu.go and gen_linux.sh also find CUDA on Arch Linux
2024-01-18 13:32:10 -08:00
Daniel Hiltgen
1b249748ab
Add multiple CPU variants for Intel Mac
...
This also refines the build process for the ext_server build.
2024-01-17 15:08:54 -08:00
Alexander F. Rødseth
cbe2adc78a
Merge branch 'main' into archlinux
2024-01-17 12:50:11 +01:00
Daniel Hiltgen
795674dd90
Bump llama.cpp to b1842 and add new cuda lib dep
...
Upstream llama.cpp has added a new dependency with the
NVIDIA CUDA Driver Libraries (libcuda.so) which is part of the
driver distribution, not the general cuda libraries, and is not
available as an archive, so we can not statically link it. This may
introduce some additional compatibility challenges which we'll
need to keep an eye on.
2024-01-16 12:53:52 -08:00
Bruce MacDonald
a897e833b8
do not cache prompt ( #2018 )
...
- prompt cache causes inferance to hang after some time
2024-01-16 13:48:05 -05:00
Daniel Hiltgen
8795447dad
Merge pull request #1966 from fpreiss/fpreiss/gen_linux_cuda_detection
...
improve cuda detection (rel. issue #1704 )
2024-01-14 18:00:11 -08:00
Daniel Hiltgen
95ad9a9fc8
Merge pull request #1988 from dhiltgen/fix_intel_mac
...
Fix typo in arm mac arch script
2024-01-14 08:45:18 -08:00
Daniel Hiltgen
3ca5f69ce8
Fix typo in arm mac arch script
2024-01-14 08:32:57 -08:00
Daniel Hiltgen
cfa6337960
Merge pull request #1982 from dhiltgen/fix_intel_mac
...
Fix intel mac build
2024-01-14 08:26:46 -08:00
Alexander F. Rødseth
f4bf1d514f
Let gpu.go and gen_linux.sh also find CUDA on Arch Linux
2024-01-14 13:40:36 +01:00
Jeffrey Morgan
557110d0ba
Disable mmap
with lora layers ( #1985 )
2024-01-13 23:36:31 -05:00
Daniel Hiltgen
2ecb247276
Fix intel mac build
...
Make sure we're building an x86 ext_server lib when cross-compiling
2024-01-13 14:46:34 -08:00
Jeffrey Morgan
288ef8ff95
add gcc -lstdc++
flag for linux cpu ( #1974 )
2024-01-13 03:53:00 -05:00