Commit graph

414 commits

Author SHA1 Message Date
Daniel Hiltgen
0b5c589ca2
Merge pull request #3966 from dhiltgen/bump
Fix target in gen_windows.ps1
2024-04-26 15:36:53 -07:00
Michael Yang
65fadddc85
Merge pull request #3964 from ollama/mxyng/weights
fix gemma, command-r layer weights
2024-04-26 15:23:33 -07:00
Daniel Hiltgen
ed5fb088c4 Fix target in gen_windows.ps1 2024-04-26 15:10:42 -07:00
Michael Yang
f81f308118 fix gemma, command-r layer weights 2024-04-26 15:00:55 -07:00
Jeffrey Morgan
bb31def011
return code 499 when user cancels request while a model is loading (#3955) 2024-04-26 17:38:29 -04:00
Daniel Hiltgen
5c0c2d1d09
Merge pull request #3954 from dhiltgen/ci_fixes
Put back non-avx CPU build for windows
2024-04-26 13:09:03 -07:00
Daniel Hiltgen
421c878a2d Put back non-avx CPU build for windows 2024-04-26 12:44:07 -07:00
Daniel Hiltgen
85801317d1 Fix clip log import 2024-04-26 09:43:46 -07:00
Daniel Hiltgen
2ed0d65948 Bump llama.cpp to b2737 2024-04-26 09:43:28 -07:00
Daniel Hiltgen
8671fdeda6 Refactor windows generate for more modular usage 2024-04-26 08:35:50 -07:00
Daniel Hiltgen
8feb97dc0d Move cuda/rocm dependency gathering into generate script
This will make it simpler for CI to accumulate artifacts from prior steps
2024-04-25 22:38:44 -07:00
Michael Yang
de4ded68b0
Merge pull request #3923 from ollama/mxyng/mem
only count output tensors
2024-04-25 16:34:17 -07:00
Daniel Hiltgen
9b5a3c5991
Merge pull request #3914 from dhiltgen/mac_perf
Improve mac parallel performance
2024-04-25 16:28:31 -07:00
Jeffrey Morgan
993cf8bf55
llm: limit generation to 10x context size to avoid run on generations (#3918)
* llm: limit generation to 10x context size to avoid run on generations

* add comment

* simplify condition statement
2024-04-25 19:02:30 -04:00
Michael Yang
7bb7cb8a60 only count output tensors 2024-04-25 15:24:08 -07:00
jmorganca
ddf5c09a9b use matrix multiplcation kernels in more cases 2024-04-25 13:58:54 -07:00
Roy Yang
5f73c08729
Remove trailing spaces (#3889) 2024-04-25 14:32:26 -04:00
Daniel Hiltgen
6e76348df7
Merge pull request #3834 from dhiltgen/not_found_in_path
Report errors on server lookup instead of path lookup failure
2024-04-24 10:50:48 -07:00
Patrick Devine
14476d48cc
fixes for gguf (#3863) 2024-04-23 20:57:20 -07:00
Daniel Hiltgen
5445aaa94e Add back memory escape valve
If we get our predictions wrong, this can be used to
set a lower memory limit as a workaround.  Recent multi-gpu
refactoring accidentally removed it, so this adds it back.
2024-04-23 17:09:02 -07:00
Daniel Hiltgen
058f6cd2cc Move nested payloads to installer and zip file on windows
Now that the llm runner is an executable and not just a dll, more users are facing
problems with security policy configurations on windows that prevent users
writing to directories and then executing binaries from the same location.
This change removes payloads from the main executable on windows and shifts them
over to be packaged in the installer and discovered based on the executables location.
This also adds a new zip file for people who want to "roll their own" installation model.
2024-04-23 16:14:47 -07:00
Daniel Hiltgen
58888a74bc Detect and recover if runner removed
Tmp cleaners can nuke the file out from underneath us.  This detects the missing
runner, and re-initializes the payloads.
2024-04-23 10:05:26 -07:00
Daniel Hiltgen
cc5a71e0e3
Merge pull request #3709 from remy415/custom-gpu-defs
Adds support for customizing GPU build flags in llama.cpp
2024-04-23 09:28:34 -07:00
Michael Yang
e83bcf7f9a
Merge pull request #3836 from ollama/mxyng/mixtral
fix: mixtral graph
2024-04-23 09:15:10 -07:00
Daniel Hiltgen
34b9db5afc Request and model concurrency
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
2024-04-22 19:29:12 -07:00
Daniel Hiltgen
8711d03df7 Report errors on server lookup instead of path lookup failure 2024-04-22 19:08:47 -07:00
Michael Yang
435cc866a3 fix: mixtral graph 2024-04-22 17:19:44 -07:00
Daniel Hiltgen
aa72281eae Trim spaces and quotes from llm lib override 2024-04-22 17:11:14 -07:00
Jeremy
9c0db4cc83
Update gen_windows.ps1
Fixed improper env references
2024-04-21 16:13:41 -04:00
Cheng
62be2050dd
chore: use errors.New to replace fmt.Errorf will much better (#3789) 2024-04-20 22:11:06 -04:00
Jeremy
6f18297b3a
Update gen_windows.ps1
Forgot a " on the write-host
2024-04-18 19:47:44 -04:00
Jeremy
15016413de
Update gen_windows.ps1
Added OLLAMA_CUSTOM_CUDA_DEFS and OLLAMA_CUSTOM_ROCM_DEFS to customize GPU builds on Windows
2024-04-18 19:27:16 -04:00
Jeremy
440b7190ed
Update gen_linux.sh
Added OLLAMA_CUSTOM_CUDA_DEFS and OLLAMA_CUSTOM_ROCM_DEFS instead of OLLAMA_CUSTOM_GPU_DEFS
2024-04-18 19:18:10 -04:00
Jeremy
3934c15895
Merge branch 'ollama:main' into custom-gpu-defs 2024-04-18 09:55:10 -04:00
Jeremy
fd048f1367
Merge branch 'ollama:main' into arm64static 2024-04-18 09:55:04 -04:00
Michael Yang
8645076a71
Merge pull request #3712 from ollama/mxyng/mem
add stablelm graph calculation
2024-04-17 15:57:51 -07:00
Michael Yang
05e9424824
Merge pull request #3664 from ollama/mxyng/fix-padding-2
fix padding to only return padding
2024-04-17 15:57:40 -07:00
Michael Yang
3cf483fe48 add stablelm graph calculation 2024-04-17 13:57:19 -07:00
Jeremy
52f5370c48 add support for custom gpu build flags for llama.cpp 2024-04-17 16:00:48 -04:00
Jeremy
7c000ec3ed adds support for OLLAMA_CUSTOM_GPU_DEFS to customize GPU build flags 2024-04-17 15:21:05 -04:00
Jeremy
ea4c284a48
Merge branch 'ollama:main' into arm64static 2024-04-17 15:11:38 -04:00
Jeremy
8aec92fa6d rearranged conditional logic for static build, dockerfile updated 2024-04-17 14:43:28 -04:00
Michael Yang
a8b9b930b4 account for all non-repeating layers 2024-04-17 11:21:21 -07:00
Jeremy
70261b9bb6 move static build to its own flag 2024-04-17 13:04:28 -04:00
Michael Yang
e74163af4c fix padding to only return padding 2024-04-16 15:43:26 -07:00
Michael Yang
26df674785 scale graph based on gpu count 2024-04-16 14:44:13 -07:00
Jeffrey Morgan
7c9792a6e0
Support unicode characters in model path (#3681)
* parse wide argv characters on windows

* cleanup

* move cleanup to end of `main`
2024-04-16 17:00:12 -04:00
Michael Yang
41a272de9f darwin: no partial offloading if required memory greater than system 2024-04-16 11:22:38 -07:00
Jeffrey Morgan
f335722275
update llama.cpp submodule to 7593639 (#3665) 2024-04-15 23:04:43 -04:00
Michael Yang
6d53b67c2c
Merge pull request #3663 from ollama/mxyng/fix-padding 2024-04-15 17:44:54 -07:00