Daniel Hiltgen
de76b95dd4
Bump llama.cpp to b2081
2024-02-06 12:06:43 -08:00
Daniel Hiltgen
27aa2d4a19
Merge pull request #1849 from mraiser/main
...
Accomodate split cuda lib dir
2024-02-05 16:01:16 -08:00
Daniel Hiltgen
e1f50377f4
Harden generate patching model
...
Only apply patches if we have any, and make sure to cleanup
every file we patched at the end to leave the tree clean
2024-02-01 19:34:36 -08:00
Jeffrey Morgan
f11bf0740b
use llm.ImageData
2024-01-31 19:13:48 -08:00
Michael Yang
8450bf66e6
trim images
2024-01-31 19:13:47 -08:00
Daniel Hiltgen
72b12c3be7
Bump llama.cpp to b1999
...
This requires an upstream change to support graceful termination,
carried as a patch.
2024-01-30 16:52:12 -08:00
Jeffrey Morgan
2e06ed01d5
remove unknown CPPFLAGS
option
2024-01-28 17:51:23 -08:00
mraiser
4c4c730a0a
Merge branch 'ollama:main' into main
2024-01-27 21:56:11 -05:00
Daniel Hiltgen
e02ecfb6c8
Merge pull request #2116 from dhiltgen/cc_50_80
...
Add support for CUDA 5.0 cards
2024-01-27 10:28:38 -08:00
Jeffrey Morgan
3ebd6a83fc
update submodule to cd4fddb29f81d6a1f6d51a0c016bc6b486d68def
2024-01-25 13:54:11 -08:00
Jeffrey Morgan
a64570dcae
Fix clearing kv cache between requests with the same prompt ( #2186 )
...
* Fix clearing kv cache between requests with the same prompt
* fix powershell script
2024-01-25 13:46:20 -08:00
mraiser
a4564232a4
Update gen_linux.sh to find libcudart in separate directory
2024-01-25 09:49:35 -05:00
Michael Yang
cd22855ef8
refactor tensor read
2024-01-24 10:48:31 -08:00
Jeffrey Morgan
4458efb73a
Load all layers on arm64
macOS if model is small enough ( #2149 )
2024-01-22 17:40:06 -08:00
Daniel Hiltgen
0f5b843319
Refine Accelerate usage on mac
...
For old macs, accelerate seems to cause crashes, but for
AVX2 capable macs, it does not.
2024-01-22 16:25:56 -08:00
Jeffrey Morgan
ffaf52e1e9
update submodule to 011e8ec577fd135cbc02993d3ea9840c516d6a1c
2024-01-22 15:16:54 -08:00
Daniel Hiltgen
3bc28736cd
Merge pull request #2143 from dhiltgen/llm_verbosity
...
Refine debug logging for llm
2024-01-22 13:19:16 -08:00
Daniel Hiltgen
730dcfcc7a
Refine debug logging for llm
...
This wires up logging in llama.cpp to always go to stderr, and also
turns up logging if OLLAMA_DEBUG is set.
2024-01-22 12:26:49 -08:00
Daniel Hiltgen
27a2d5af54
Debug logging on init failure
2024-01-22 12:08:22 -08:00
Jeffrey Morgan
5f81a33f43
update submodule to 6f9939d
( #2115 )
2024-01-22 11:56:40 -08:00
Daniel Hiltgen
5576bb2348
Merge pull request #2130 from dhiltgen/more_faster
...
Make CPU builds parallel and customizable AMD GPUs
2024-01-21 16:14:12 -08:00
Daniel Hiltgen
ec3764538d
Probe GPUs before backend init
...
Detect potential error scenarios so we can fallback to CPU mode without
hitting asserts.
2024-01-21 15:59:38 -08:00
Daniel Hiltgen
df54c723ae
Make CPU builds parallel and customizable AMD GPUs
...
The linux build now support parallel CPU builds to speed things up.
This also exposes AMD GPU targets as an optional setting for advaced
users who want to alter our default set.
2024-01-21 15:12:21 -08:00
Jeffrey Morgan
89c4aee29e
Unlock mutex when failing to load model ( #2117 )
2024-01-20 20:54:46 -05:00
Daniel Hiltgen
a447a083f2
Add compute capability 5.0, 7.5, and 8.0
2024-01-20 14:24:05 -08:00
Daniel Hiltgen
681a914990
Add support for CUDA 5.2 cards
2024-01-20 10:48:43 -08:00
Jeffrey Morgan
4c54f0ddeb
sign dylibs on macOS ( #2101 )
2024-01-19 19:24:11 -05:00
Daniel Hiltgen
6a042438af
Switch to local dlopen symbols
2024-01-19 11:37:02 -08:00
Jeffrey Morgan
dc88cc3981
use gzip
for runner embedding ( #2067 )
2024-01-19 13:23:03 -05:00
Daniel Hiltgen
abec7f06e5
Merge pull request #2056 from dhiltgen/slog
...
Mechanical switch from log to slog
2024-01-18 14:27:24 -08:00
Daniel Hiltgen
fedd705aea
Mechanical switch from log to slog
...
A few obvious levels were adjusted, but generally everything mapped to "info" level.
2024-01-18 14:12:57 -08:00
Daniel Hiltgen
fccdf4c635
Merge pull request #1987 from xyproto/archlinux
...
Let gpu.go and gen_linux.sh also find CUDA on Arch Linux
2024-01-18 13:32:10 -08:00
Daniel Hiltgen
1b249748ab
Add multiple CPU variants for Intel Mac
...
This also refines the build process for the ext_server build.
2024-01-17 15:08:54 -08:00
Alexander F. Rødseth
cbe2adc78a
Merge branch 'main' into archlinux
2024-01-17 12:50:11 +01:00
Daniel Hiltgen
795674dd90
Bump llama.cpp to b1842 and add new cuda lib dep
...
Upstream llama.cpp has added a new dependency with the
NVIDIA CUDA Driver Libraries (libcuda.so) which is part of the
driver distribution, not the general cuda libraries, and is not
available as an archive, so we can not statically link it. This may
introduce some additional compatibility challenges which we'll
need to keep an eye on.
2024-01-16 12:53:52 -08:00
Bruce MacDonald
a897e833b8
do not cache prompt ( #2018 )
...
- prompt cache causes inferance to hang after some time
2024-01-16 13:48:05 -05:00
Daniel Hiltgen
8795447dad
Merge pull request #1966 from fpreiss/fpreiss/gen_linux_cuda_detection
...
improve cuda detection (rel. issue #1704 )
2024-01-14 18:00:11 -08:00
Daniel Hiltgen
95ad9a9fc8
Merge pull request #1988 from dhiltgen/fix_intel_mac
...
Fix typo in arm mac arch script
2024-01-14 08:45:18 -08:00
Daniel Hiltgen
3ca5f69ce8
Fix typo in arm mac arch script
2024-01-14 08:32:57 -08:00
Daniel Hiltgen
cfa6337960
Merge pull request #1982 from dhiltgen/fix_intel_mac
...
Fix intel mac build
2024-01-14 08:26:46 -08:00
Alexander F. Rødseth
f4bf1d514f
Let gpu.go and gen_linux.sh also find CUDA on Arch Linux
2024-01-14 13:40:36 +01:00
Jeffrey Morgan
557110d0ba
Disable mmap
with lora layers ( #1985 )
2024-01-13 23:36:31 -05:00
Daniel Hiltgen
2ecb247276
Fix intel mac build
...
Make sure we're building an x86 ext_server lib when cross-compiling
2024-01-13 14:46:34 -08:00
Jeffrey Morgan
288ef8ff95
add gcc -lstdc++
flag for linux cpu ( #1974 )
2024-01-13 03:53:00 -05:00
Jeffrey Morgan
4cf17990f7
use g++ to build libext_server.so
on linux ( #1972 )
2024-01-13 03:12:42 -05:00
Michael Yang
eaed6f8c45
add max context length check
2024-01-12 14:54:07 -08:00
Fabian Preiss
905862e17b
improve cuda detection (rel. issue #1704 )
2024-01-12 21:59:19 +01:00
Daniel Hiltgen
3773fb6465
Merge pull request #1935 from dhiltgen/cpu_fallback
...
Fix up the CPU fallback selection
2024-01-11 15:52:32 -08:00
Daniel Hiltgen
7427fa1387
Fix up the CPU fallback selection
...
The memory changes and multi-variant change had some merge
glitches I missed. This fixes them so we actually get the cpu llm lib
and best variant for the given system.
2024-01-11 15:27:06 -08:00
Michael Yang
d2be6387c9
fix typo
2024-01-11 14:25:21 -08:00
Michael Yang
d7af35d3d0
import fmt
2024-01-11 14:22:32 -08:00
Michael Yang
defc1dbd6e
use x/exp/slices
2024-01-11 14:20:13 -08:00
Daniel Hiltgen
de2fbdec99
Merge pull request #1819 from dhiltgen/multi_variant
...
Support multiple LLM libs; ROCm v5 and v6; Rosetta, AVX, and AVX2 compatible CPU builds
2024-01-11 14:00:48 -08:00
Michael Yang
f4f939de28
Merge pull request #1552 from jmorganca/mxyng/lint-test
...
add lint and test on pull_request
2024-01-11 09:37:45 -08:00
Daniel Hiltgen
39928a42e8
Always dynamically load the llm server library
...
This switches darwin to dynamic loading, and refactors the code now that no
static linking of the library is used on any platform
2024-01-11 08:42:47 -08:00
Daniel Hiltgen
d88c527be3
Build multiple CPU variants and pick the best
...
This reduces the built-in linux version to not use any vector extensions
which enables the resulting builds to run under Rosetta on MacOS in
Docker. Then at runtime it checks for the actual CPU vector
extensions and loads the best CPU library available
2024-01-11 08:42:47 -08:00
Jeffrey Morgan
ab6be852c7
revisit memory allocation to account for full kv cache on main gpu
2024-01-11 01:45:31 -05:00
Daniel Hiltgen
8da7bef05f
Support multiple variants for a given llm lib type
...
In some cases we may want multiple variants for a given GPU type or CPU.
This adds logic to have an optional Variant which we can use to select
an optimal library, but also allows us to try multiple variants in case
some fail to load.
This can be useful for scenarios such as ROCm v5 vs v6 incompatibility
or potentially CPU features.
2024-01-10 17:27:51 -08:00
Jeffrey Morgan
b24e8d17b2
Increase minimum CUDA memory allocation overhead and fix minimum overhead for multi-gpu ( #1896 )
...
* increase minimum cuda overhead and fix minimum overhead for multi-gpu
* fix multi gpu overhead
* limit overhead to 10% of all gpus
* better wording
* allocate fixed amount before layers
* fixed only includes graph alloc
2024-01-10 19:08:51 -05:00
Jeffrey Morgan
f83881390f
revert submodule back to 328b83de23b33240e28f4e74900d1d06726f5eb1
2024-01-10 18:42:39 -05:00
Jeffrey Morgan
224fbf2795
update submodule to commit 1fc2f265ff9377a37fd2c61eae9cd813a3491bea
until its main branch is fixed
2024-01-10 17:03:15 -05:00
Jeffrey Morgan
2c6e8f5248
Update submodule to 6efb8eb30e7025b168f3fda3ff83b9b386428ad6
( #1885 )
...
* update submodule to `6efb8eb30e7025b168f3fda3ff83b9b386428ad6`
* unblock condition variable in `update_slots` when closing server
2024-01-10 16:48:38 -05:00
Jeffrey Morgan
34344d801c
clean up cmake build
directory when cross compiling macOS builds
2024-01-09 17:13:56 -05:00
Jeffrey Morgan
8a8c7e7f8d
only build for metal on arm64
2024-01-09 13:51:08 -05:00
Michael Yang
f921e2696e
typo
2024-01-09 09:45:42 -08:00
Michael Yang
4a33cede20
remove unused fields and functions
2024-01-09 09:37:40 -08:00
Michael Yang
2bb2bdd5d4
fix lint
2024-01-09 09:36:58 -08:00
Jeffrey Morgan
f387e9631b
use runner if cuda alloc won't fit
2024-01-09 00:44:34 -05:00
Jeffrey Morgan
cb534e6ac2
use 10% vram overhead for cuda
2024-01-08 23:17:44 -05:00
Jeffrey Morgan
58ce2d8273
better estimate scratch buffer size
2024-01-08 21:32:44 -05:00
Jeffrey Morgan
18ddf6d57d
fix windows build
2024-01-08 20:04:01 -05:00
Jeffrey Morgan
08f1e18965
Offload layers to GPU based on new model size estimates ( #1850 )
...
* select layers based on estimated model memory usage
* always account for scratch vram
* dont load +1 layers
* better estmation for graph alloc
* Update gpu/gpu_darwin.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update llm/llm.go
* add overhead for cuda memory
* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* fix build error on linux
* address comments
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2024-01-08 16:42:00 -05:00
Jeffrey Morgan
5feec959ad
dont use -Wall
in static build ( #1833 )
2024-01-07 10:39:19 -05:00
Jeffrey Morgan
dbdd50b283
add -DCMAKE_SYSTEM_NAME=Darwin
cmake flag ( #1832 )
2024-01-07 00:46:17 -05:00
Bruce MacDonald
3367b5f3df
remove unused generate patches ( #1810 )
2024-01-05 11:25:45 -05:00
Daniel Hiltgen
9983fa5f4e
Cleaup stale submodule
...
If the tree has a stale submodule, make sure we clean it up first
2024-01-04 13:40:16 -08:00
Daniel Hiltgen
fac9060da5
Init submodule with new path
2024-01-04 13:00:13 -08:00
Daniel Hiltgen
77d96da94b
Code shuffle to clean up the llm dir
2024-01-04 12:12:05 -08:00
Daniel Hiltgen
e9ce91e9a6
Load dynamic cpu lib on windows
...
On linux, we link the CPU library in to the Go app and fall back to it
when no GPU match is found. On windows we do not link in the CPU library
so that we can better control our dependencies for the CLI. This fixes
the logic so we correctly fallback to the dynamic CPU library
on windows.
2024-01-04 08:41:41 -08:00
Jeffrey Morgan
c0285158a9
tweak memory requirements error text
2024-01-03 19:47:18 -05:00
Jeffrey Morgan
77a66df72c
add macOS memory check for 47B models
2024-01-03 19:46:16 -05:00
Jeffrey Morgan
5b4837f881
remove unused filetype check
2024-01-03 19:45:39 -05:00
Jeffrey Morgan
29340c2e62
update cmake flags for amd64
macOS ( #1780 )
...
* update cmake flags for intel macOS
* remove `LLAMA_K_QUANTS`
* put back `CMAKE_OSX_DEPLOYMENT_TARGET` and disable `LLAMA_F16C`
2024-01-03 19:22:15 -05:00
Daniel Hiltgen
d5ec730354
Merge pull request #1779 from dhiltgen/refined_amd_gpu_list
...
Improve maintainability of Radeon card list
2024-01-03 16:18:57 -08:00
Daniel Hiltgen
ddbfa6fe31
Fix CPU only builds
...
Go embed doesn't like when there's no matching files, so put
a dummy placeholder in to allow building without any GPU support
If no "server" library is found, it's safely ignored at runtime.
2024-01-03 16:08:34 -08:00
Daniel Hiltgen
16f4603b67
Improve maintainability of Radeon card list
...
This moves the list of AMD GPUs to an easier to maintain list which
should make it easier to update over time.
2024-01-03 15:16:56 -08:00
Bruce MacDonald
0b3118e0af
fix: relay request opts to loaded llm prediction ( #1761 )
2024-01-03 12:01:42 -05:00
Daniel Hiltgen
0498f7ce56
Get rid of one-line llama.log
...
This one log line was triggering a single line llama.log to be generated
in the pwd of the server
2024-01-02 15:36:16 -08:00
Daniel Hiltgen
738a8d12eb
Rename the ollama cmakefile
2024-01-02 15:36:16 -08:00
Daniel Hiltgen
d966b730ac
Switch windows build to fully dynamic
...
Refactor where we store build outputs, and support a fully dynamic loading
model on windows so the base executable has no special dependencies thus
doesn't require a special PATH.
2024-01-02 15:36:16 -08:00
Daniel Hiltgen
9a70aecccb
Refactor how we augment llama.cpp
...
This changes the model for llama.cpp inclusion so we're not applying a patch,
but instead have the C++ code directly in the ollama tree, which should make it
easier to refine and update over time.
2024-01-02 15:35:55 -08:00
Jeffrey Morgan
d4ebdadbe7
enable cache_prompt
by default
2023-12-27 14:23:42 -05:00
K0IN
10da41d677
Add Cache flag to api ( #1642 )
2023-12-22 17:16:20 -05:00
Daniel Hiltgen
e5202eb687
Quiet down llama.cpp logging by default
...
By default builds will now produce non-debug and non-verbose binaries.
To enable verbose logs in llama.cpp and debug symbols in the
native code, set `CGO_CFLAGS=-g`
2023-12-22 08:47:18 -08:00
Daniel Hiltgen
fa24e73b82
Remove CPU build, fixup linux build script
2023-12-21 18:21:31 -08:00
Daniel Hiltgen
325d74985b
Fix CPU performance on hyperthreaded systems
...
The default thread count logic was broken and resulted in 2x the number
of threads as it should on a hyperthreading CPU
resulting in thrashing and poor performance.
2023-12-21 16:23:36 -08:00
Daniel Hiltgen
d9cd3d9667
Revive windows build
...
The windows native setup still needs some more work, but this gets it building
again and if you set the PATH properly, you can run the resulting exe on a cuda system.
2023-12-20 17:21:54 -08:00
Daniel Hiltgen
7555ea44f8
Revamp the dynamic library shim
...
This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.
This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.
2023-12-20 14:45:57 -08:00
Daniel Hiltgen
6558f94ed0
Fix darwin intel build
2023-12-19 13:32:24 -08:00
Daniel Hiltgen
54dbfa4c4a
Carry ggml-metal.metal as payload
2023-12-19 09:05:46 -08:00