ollama

Author	SHA1	Message	Date
Michael Yang	6d53b67c2c	Merge pull request #3663 from ollama/mxyng/fix-padding	2024-04-15 17:44:54 -07:00
Michael Yang	969238b19e	fix padding in decode TODO: update padding() to _only_ returning the padding	2024-04-15 17:27:06 -07:00
Patrick Devine	9f8691c6c8	Add llama2 / torch models for `ollama create` (#3607 )	2024-04-15 11:26:42 -07:00
Jeffrey Morgan	a0b8a32eb4	Terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading (#3653 ) * terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading * use `unload` in signal handler	2024-04-15 12:09:32 -04:00
Jeffrey Morgan	309aef7fee	update llama.cpp submodule to `4bd0f93` (#3627 )	2024-04-13 10:43:02 -07:00
Michael Yang	3397eff0cd	mixtral mem	2024-04-11 11:10:41 -07:00
Michael Yang	7e33a017c0	partial offloading	2024-04-10 11:37:20 -07:00
Michael Yang	8b2c10061c	refactor tensor query	2024-04-10 11:37:20 -07:00
Daniel Hiltgen	c5ff443b9f	Handle very slow model loads During testing, we're seeing some models take over 3 minutes.	2024-04-09 16:35:10 -07:00
Blake Mizerany	1524f323a3	Revert "build.go: introduce a friendlier way to build Ollama (#3548 )" (#3564 )	2024-04-09 15:57:45 -07:00
Blake Mizerany	fccf3eecaa	build.go: introduce a friendlier way to build Ollama (#3548 ) This commit introduces a more friendly way to build Ollama dependencies and the binary without abusing `go generate` and removing the unnecessary extra steps it brings with it. This script also provides nicer feedback to the user about what is happening during the build process. At the end, it prints a helpful message to the user about what to do next (e.g. run the new local Ollama).	2024-04-09 14:18:47 -07:00
Michael Yang	c77d45d836	Merge pull request #3506 from ollama/mxyng/quantize-redux cgo quantize	2024-04-09 12:32:53 -07:00
Jeffrey Morgan	5ec12cec6c	update llama.cpp submodule to `1b67731` (#3561 )	2024-04-09 15:10:17 -04:00
Michael Yang	9502e5661f	cgo quantize	2024-04-08 15:31:08 -07:00
Jeffrey Morgan	63efa075a0	update generate scripts with new `LLAMA_CUDA` variable, set `HIP_PLATFORM` to avoid compiler errors (#3528 )	2024-04-07 19:29:51 -04:00
Michael Yang	be517e491c	no rope parameters	2024-04-05 18:05:27 -07:00
Michael Yang	fc8e108642	Merge pull request #3496 from ollama/mxyng/cmd-r-graph add command-r graph estimate	2024-04-05 12:26:21 -07:00
Daniel Hiltgen	dfe330fa1c	Merge pull request #3488 from mofanke/fix-windows-dll-compress fix dll compress in windows building	2024-04-04 16:12:13 -07:00
Michael Yang	01f77ae25d	add command-r graph estimate	2024-04-04 14:07:24 -07:00
Daniel Hiltgen	36bd967722	Fail fast if mingw missing on windows	2024-04-04 09:51:26 -07:00
mofanke	4de0126719	fix dll compress in windows building	2024-04-04 21:27:33 +08:00
Daniel Hiltgen	e4a7e5b2ca	Fix CI release glitches The subprocess change moved the build directory arm64 builds weren't setting cross-compilation flags when building on x86	2024-04-03 16:41:40 -07:00
Michael Yang	12e923e158	update graph size estimate	2024-04-03 13:34:12 -07:00
Jeffrey Morgan	cd135317d2	Fix macOS builds on older SDKs (#3467 )	2024-04-03 10:45:54 -07:00
Michael Yang	4f895d633f	Merge pull request #3466 from ollama/mxyng/head-kv default head_kv to 1	2024-04-03 10:41:00 -07:00
Daniel Hiltgen	464d817824	Merge pull request #3464 from dhiltgen/subprocess Fix numgpu opt miscomparison	2024-04-02 20:10:17 -07:00
Daniel Hiltgen	6589eb8a8c	Revert options as a ref in the server	2024-04-02 16:44:10 -07:00
Michael Yang	90f071c658	default head_kv to 1	2024-04-02 16:37:59 -07:00
Michael Yang	80163ebcb5	fix metal gpu	2024-04-02 16:06:45 -07:00
Daniel Hiltgen	0035e31af8	Bump to b2581	2024-04-02 11:53:07 -07:00
Daniel Hiltgen	0a0e9f3e0f	Apply 01-cache.diff	2024-04-01 16:48:18 -07:00
Daniel Hiltgen	58d95cc9bd	Switch back to subprocessing for llama.cpp This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.	2024-04-01 16:48:18 -07:00
Michael Yang	91b3e4d282	update memory calcualtions count each layer independently when deciding gpu offloading	2024-04-01 13:16:32 -07:00
Michael Yang	d338d70492	refactor model parsing	2024-04-01 13:16:15 -07:00
Patrick Devine	5a5efee46b	Add gemma safetensors conversion (#3250 ) Co-authored-by: Michael Yang <mxyng@pm.me>	2024-03-28 18:54:01 -07:00
Jeffrey Morgan	f5ca7f8c8e	add license in file header for vendored llama.cpp code (#3351 )	2024-03-26 16:23:23 -04:00
Jeffrey Morgan	856b8ec131	remove need for `$VSINSTALLDIR` since build will fail if `ninja` cannot be found (#3350 )	2024-03-26 16:23:16 -04:00
Patrick Devine	1b272d5bcd	change `github.com/jmorganca/ollama` to `github.com/ollama/ollama` (#3347 )	2024-03-26 13:04:17 -07:00
Daniel Hiltgen	8091ef2eeb	Bump llama.cpp to b2527	2024-03-25 13:47:44 -07:00
Daniel Hiltgen	560be5e0b6	Merge pull request #3308 from dhiltgen/bump_more Bump llama.cpp to b2510	2024-03-25 12:56:12 -07:00
Jeremy	dfc6721b20	add support for libcudart.so for CUDA devices (adds Jetson support)	2024-03-25 11:07:44 -04:00
Blake Mizerany	acfa2b9422	llm: prevent race appending to slice (#3320 )	2024-03-24 11:35:54 -07:00
Daniel Hiltgen	3e30c75f3e	Bump llama.cpp to b2510	2024-03-23 19:55:56 +01:00
Daniel Hiltgen	43799532c1	Bump llama.cpp to b2474 The release just before ggml-cuda.cu refactoring	2024-03-23 09:54:56 +01:00
Daniel Hiltgen	74788b487c	Better tmpdir cleanup If expanding the runners fails, don't leave a corrupt/incomplete payloads dir We now write a pid file out to the tmpdir, which allows us to scan for stale tmpdirs and remove this as long as there isn't still a process running.	2024-03-20 16:03:19 +01:00
Michael Yang	3c4ad0ecab	dyn global	2024-03-18 09:45:45 +01:00
Michael Yang	22f326464e	Merge pull request #3083 from ollama/mxyng/refactor-readseeker refactor readseeker	2024-03-16 12:08:56 -07:00
Jeffrey Morgan	e95ffc7448	llama: remove server static assets (#3174 )	2024-03-15 19:24:12 -07:00
Daniel Hiltgen	ab3456207b	Merge pull request #3028 from ollama/ci_release CI release process	2024-03-15 16:40:54 -07:00
Daniel Hiltgen	6ad414f31e	Merge pull request #3086 from dhiltgen/import_server Import server.cpp to retain llava support	2024-03-15 16:10:35 -07:00
Daniel Hiltgen	d4c10df2b0	Add Radeon gfx940-942 GPU support	2024-03-15 15:34:58 -07:00
Daniel Hiltgen	540f4af45f	Wire up more complete CI for releases Flesh out our github actions CI so we can build official releaes.	2024-03-15 12:37:36 -07:00
Blake Mizerany	6ce37e4d96	llm,readline: use errors.Is instead of simple == check (#3161 ) This fixes some brittle, simple equality checks to use errors.Is. Since go1.13, errors.Is is the idiomatic way to check for errors. Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>	2024-03-15 07:14:12 -07:00
Michael Yang	291c663865	fix: clip memory leak	2024-03-14 13:12:42 -07:00
Jeffrey Morgan	e72c567cfd	restore locale patch (#3091 )	2024-03-12 22:08:13 -07:00
Bruce MacDonald	3e22611200	token repeat limit for prediction requests (#3080 )	2024-03-12 22:08:25 -04:00
Bruce MacDonald	2f804068bd	warn when json format is expected but not mentioned in prompt (#3081 )	2024-03-12 19:07:11 -04:00
Daniel Hiltgen	85129d3a32	Adapt our build for imported server.cpp	2024-03-12 14:57:15 -07:00
Daniel Hiltgen	9ac6440da3	Import server.cpp as of b2356	2024-03-12 13:58:06 -07:00
Michael Yang	0085297928	refactor readseeker	2024-03-12 12:54:18 -07:00
racerole	53c107e20e	chore: fix typo (#3073 ) Signed-off-by: racerole <jiangyifeng@outlook.com>	2024-03-12 14:09:22 -04:00
Bruce MacDonald	b80661e8c7	relay load model errors to the client (#3065 )	2024-03-11 16:48:27 -04:00
Jeffrey Morgan	369eda65f5	update llama.cpp submodule to `ceca1ae` (#3064 )	2024-03-11 12:57:48 -07:00
Daniel Hiltgen	bc13da2bfe	Avoid rocm runner and dependency clash Putting the rocm symlink next to the runners is risky. This moves the payloads into a subdir to avoid potential clashes.	2024-03-11 09:33:22 -07:00
Jeffrey Morgan	41b00b9856	fix `03-locale.diff`	2024-03-10 16:21:05 -07:00
Daniel Hiltgen	3dc1bb6a35	Harden for deps file being empty (or short)	2024-03-10 14:45:38 -07:00
Jeffrey Morgan	908005d90b	patch: use default locale in wpm tokenizer (#3034 )	2024-03-09 21:12:12 -08:00
Jeffrey Morgan	e11668aa07	add `bundle_metal` and `cleanup_metal` funtions to `gen_darwin.sh`	2024-03-09 16:04:57 -08:00
Jeffrey Morgan	1ffb1e2874	update llama.cpp submodule to `77d1ac7` (#3030 )	2024-03-09 15:55:34 -08:00
Jeffrey Morgan	f9cd55c70b	disable gpu for certain model architectures and fix divide-by-zero on memory estimation	2024-03-09 12:51:38 -08:00
Daniel Hiltgen	4a5c9b8035	Finish unwinding idempotent payload logic The recent ROCm change partially removed idempotent payloads, but the ggml-metal.metal file for mac was still idempotent. This finishes switching to always extract the payloads, and now that idempotentcy is gone, the version directory is no longer useful.	2024-03-09 08:34:39 -08:00
Jeffrey Morgan	efe5617b64	update llama.cpp submodule to `c2101a2` (#3020 )	2024-03-09 00:44:50 -08:00
Michael Yang	76bdebbadf	decode ggla	2024-03-08 15:46:25 -08:00
Jeffrey Morgan	0e4669b04f	update llama.cpp submodule to `6cdabe6` (#2999 )	2024-03-08 00:26:20 -08:00
Daniel Hiltgen	6c5ccb11f9	Revamp ROCm support This refines where we extract the LLM libraries to by adding a new OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already idempotenent, so this should speed up startups after the first time a new release is deployed. It also cleans up after itself. We now build only a single ROCm version (latest major) on both windows and linux. Given the large size of ROCms tensor files, we split the dependency out. It's bundled into the installer on windows, and a separate download on windows. The linux install script is now smart and detects the presence of AMD GPUs and looks to see if rocm v6 is already present, and if not, then downloads our dependency tar file. For Linux discovery, we now use sysfs and check each GPU against what ROCm supports so we can degrade to CPU gracefully instead of having llama.cpp+rocm assert/crash on us. For Windows, we now use go's windows dynamic library loading logic to access the amdhip64.dll APIs to query the GPU information.	2024-03-07 10:36:50 -08:00
John	23ebe8fe11	fix some typos (#2973 ) Signed-off-by: hishope <csqiye@126.com>	2024-03-06 22:50:11 -08:00
Patrick Devine	2c017ca441	Convert Safetensors to an Ollama model (#2824 )	2024-03-06 21:01:51 -08:00
Jeffrey Morgan	21347e1ed6	update llama.cpp submodule to `c29af7e` (#2868 )	2024-03-01 15:26:04 -08:00
Daniel Hiltgen	bd1d8b0d14	Merge pull request #2836 from bmwiedemann/gzip Omit build date from gzip headers	2024-02-29 15:46:46 -08:00
Jeffrey Morgan	cbf4970e0f	bump submodule to `87c91c07663b707e831c59ec373b5e665ff9d64a` (#2828 )	2024-02-29 09:42:08 -08:00
Bernhard M. Wiedemann	76e5d9ec88	Omit build date from gzip headers See https://reproducible-builds.org/ for why this is good. This patch was done while working on reproducible builds for openSUSE.	2024-02-29 16:48:19 +01:00
Daniel Hiltgen	061e8f6abc	Bump llama.cpp to b2276	2024-02-26 16:49:24 -08:00
Jeffrey Morgan	11bfff8ee1	update llama.cpp submodule to `96633eeca1265ed03e57230de54032041c58f9cd`	2024-02-22 16:44:26 -05:00
Jeffrey Morgan	efe040f8c0	reset with `init_vars` ahead of each cpu build in `gen_windows.ps1` (#2654 )	2024-02-21 16:35:34 -05:00
Jeffrey Morgan	2a7553ce09	update llama.cpp submodule to `c14f72d`	2024-02-21 09:03:14 -05:00
Jeffrey Morgan	b3eac61cac	update llama.cpp submodule to `f0d1fafc029a056cd765bdae58dcaa12312e9879`	2024-02-20 22:56:51 -05:00
Michael Yang	949d7b1c48	add gguf file types (#2532 )	2024-02-20 19:06:29 -05:00
Jeffrey Morgan	4613a080e7	update llama.cpp submodule to `66c1968f7` (#2618 )	2024-02-20 17:42:31 -05:00
Taras Tsugrii	01ff2e14db	[nit] Remove unused msg local var. (#2511 )	2024-02-20 14:02:34 -05:00
Daniel Hiltgen	4fcbf1cde6	Merge pull request #2599 from dhiltgen/fix_avx Explicitly disable AVX2 on GPU builds	2024-02-19 13:13:05 -08:00
Daniel Hiltgen	9220b4fa91	Merge pull request #2585 from dhiltgen/cuda_leaks Fix cuda leaks	2024-02-19 12:48:00 -08:00
Daniel Hiltgen	fc39a6cd7a	Fix cuda leaks This should resolve the problem where we don't fully unload from the GPU when we go idle.	2024-02-18 18:37:20 -08:00
Daniel Hiltgen	df6dc4fd96	Fix duplicate menus on update and exit on signals Also fixes a few fit-and-finish items for better developer experience	2024-02-16 15:33:16 -08:00
Daniel Hiltgen	db2a9ad1fe	Explicitly disable AVX2 on GPU builds Even though we weren't setting it to on, somewhere in the cmake config it was getting toggled on. By explicitly setting it to off, we get `/arch:AVX` as intended.	2024-02-15 14:50:11 -08:00
Daniel Hiltgen	29e90cc13b	Implement new Go based Desktop app This focuses on Windows first, but coudl be used for Mac and possibly linux in the future.	2024-02-15 05:56:45 +00:00
Jeffrey Morgan	9241a29336	Revert "Revert "bump submodule to `6c00a06` (#2479 )"" (#2485 ) This reverts commit `6920964b87`.	2024-02-13 18:18:41 -08:00
Jeffrey Morgan	f7231ad9ad	set `shutting_down` to `false` once shutdown is complete (#2484 )	2024-02-13 17:48:41 -08:00
Jeffrey Morgan	6920964b87	Revert "bump submodule to `6c00a06` (#2479 )" This reverts commit `2f9ed52bbd`.	2024-02-13 17:23:05 -08:00
Jeffrey Morgan	2f9ed52bbd	bump submodule to `6c00a06` (#2479 )	2024-02-13 17:12:42 -08:00
Daniel Hiltgen	939c60473f	Merge pull request #2422 from dhiltgen/better_kill More robust shutdown	2024-02-12 14:05:06 -08:00
Jeffrey Morgan	f76ca04f9e	update submodule to `099afc6` (#2468 )	2024-02-12 14:01:16 -08:00
Daniel Hiltgen	76b8728f0c	Merge pull request #2465 from dhiltgen/block_rocm_pre_9 Detect AMD GPU info via sysfs and block old cards	2024-02-12 12:41:43 -08:00
Daniel Hiltgen	6d84f07505	Detect AMD GPU info via sysfs and block old cards This wires up some new logic to start using sysfs to discover AMD GPU information and detects old cards we can't yet support so we can fallback to CPU mode.	2024-02-12 08:19:41 -08:00
Jeffrey Morgan	26b13fc33c	patch: always add token to cache_tokens (#2459 )	2024-02-12 08:10:16 -08:00
Daniel Hiltgen	6680761596	Shutdown faster Make sure that when a shutdown signal comes, we shutdown quickly instead of waiting for a potentially long exchange to wrap up.	2024-02-08 22:22:50 -08:00
Daniel Hiltgen	a1dfab43b9	Ensure the libraries are present When we store our libraries in a temp dir, a reaper might clean them when we are idle, so make sure to check for them before we reload.	2024-02-07 17:27:49 -08:00
Daniel Hiltgen	de76b95dd4	Bump llama.cpp to b2081	2024-02-06 12:06:43 -08:00
Daniel Hiltgen	27aa2d4a19	Merge pull request #1849 from mraiser/main Accomodate split cuda lib dir	2024-02-05 16:01:16 -08:00
Daniel Hiltgen	e1f50377f4	Harden generate patching model Only apply patches if we have any, and make sure to cleanup every file we patched at the end to leave the tree clean	2024-02-01 19:34:36 -08:00
Jeffrey Morgan	f11bf0740b	use `llm.ImageData`	2024-01-31 19:13:48 -08:00
Michael Yang	8450bf66e6	trim images	2024-01-31 19:13:47 -08:00
Daniel Hiltgen	72b12c3be7	Bump llama.cpp to b1999 This requires an upstream change to support graceful termination, carried as a patch.	2024-01-30 16:52:12 -08:00
Jeffrey Morgan	2e06ed01d5	remove unknown `CPPFLAGS` option	2024-01-28 17:51:23 -08:00
mraiser	4c4c730a0a	Merge branch 'ollama:main' into main	2024-01-27 21:56:11 -05:00
Daniel Hiltgen	e02ecfb6c8	Merge pull request #2116 from dhiltgen/cc_50_80 Add support for CUDA 5.0 cards	2024-01-27 10:28:38 -08:00
Jeffrey Morgan	3ebd6a83fc	update submodule to `cd4fddb29f81d6a1f6d51a0c016bc6b486d68def`	2024-01-25 13:54:11 -08:00
Jeffrey Morgan	a64570dcae	Fix clearing kv cache between requests with the same prompt (#2186 ) * Fix clearing kv cache between requests with the same prompt * fix powershell script	2024-01-25 13:46:20 -08:00
mraiser	a4564232a4	Update gen_linux.sh to find libcudart in separate directory	2024-01-25 09:49:35 -05:00
Michael Yang	cd22855ef8	refactor tensor read	2024-01-24 10:48:31 -08:00
Jeffrey Morgan	4458efb73a	Load all layers on `arm64` macOS if model is small enough (#2149 )	2024-01-22 17:40:06 -08:00
Daniel Hiltgen	0f5b843319	Refine Accelerate usage on mac For old macs, accelerate seems to cause crashes, but for AVX2 capable macs, it does not.	2024-01-22 16:25:56 -08:00
Jeffrey Morgan	ffaf52e1e9	update submodule to `011e8ec577fd135cbc02993d3ea9840c516d6a1c`	2024-01-22 15:16:54 -08:00
Daniel Hiltgen	3bc28736cd	Merge pull request #2143 from dhiltgen/llm_verbosity Refine debug logging for llm	2024-01-22 13:19:16 -08:00
Daniel Hiltgen	730dcfcc7a	Refine debug logging for llm This wires up logging in llama.cpp to always go to stderr, and also turns up logging if OLLAMA_DEBUG is set.	2024-01-22 12:26:49 -08:00
Daniel Hiltgen	27a2d5af54	Debug logging on init failure	2024-01-22 12:08:22 -08:00
Jeffrey Morgan	5f81a33f43	update submodule to `6f9939d` (#2115 )	2024-01-22 11:56:40 -08:00
Daniel Hiltgen	5576bb2348	Merge pull request #2130 from dhiltgen/more_faster Make CPU builds parallel and customizable AMD GPUs	2024-01-21 16:14:12 -08:00
Daniel Hiltgen	ec3764538d	Probe GPUs before backend init Detect potential error scenarios so we can fallback to CPU mode without hitting asserts.	2024-01-21 15:59:38 -08:00
Daniel Hiltgen	df54c723ae	Make CPU builds parallel and customizable AMD GPUs The linux build now support parallel CPU builds to speed things up. This also exposes AMD GPU targets as an optional setting for advaced users who want to alter our default set.	2024-01-21 15:12:21 -08:00
Jeffrey Morgan	89c4aee29e	Unlock mutex when failing to load model (#2117 )	2024-01-20 20:54:46 -05:00
Daniel Hiltgen	a447a083f2	Add compute capability 5.0, 7.5, and 8.0	2024-01-20 14:24:05 -08:00
Daniel Hiltgen	681a914990	Add support for CUDA 5.2 cards	2024-01-20 10:48:43 -08:00
Jeffrey Morgan	4c54f0ddeb	sign dylibs on macOS (#2101 )	2024-01-19 19:24:11 -05:00
Daniel Hiltgen	6a042438af	Switch to local dlopen symbols	2024-01-19 11:37:02 -08:00
Jeffrey Morgan	dc88cc3981	use `gzip` for runner embedding (#2067 )	2024-01-19 13:23:03 -05:00
Daniel Hiltgen	abec7f06e5	Merge pull request #2056 from dhiltgen/slog Mechanical switch from log to slog	2024-01-18 14:27:24 -08:00
Daniel Hiltgen	fedd705aea	Mechanical switch from log to slog A few obvious levels were adjusted, but generally everything mapped to "info" level.	2024-01-18 14:12:57 -08:00
Daniel Hiltgen	fccdf4c635	Merge pull request #1987 from xyproto/archlinux Let gpu.go and gen_linux.sh also find CUDA on Arch Linux	2024-01-18 13:32:10 -08:00
Daniel Hiltgen	1b249748ab	Add multiple CPU variants for Intel Mac This also refines the build process for the ext_server build.	2024-01-17 15:08:54 -08:00
Alexander F. Rødseth	cbe2adc78a	Merge branch 'main' into archlinux	2024-01-17 12:50:11 +01:00
Daniel Hiltgen	795674dd90	Bump llama.cpp to b1842 and add new cuda lib dep Upstream llama.cpp has added a new dependency with the NVIDIA CUDA Driver Libraries (libcuda.so) which is part of the driver distribution, not the general cuda libraries, and is not available as an archive, so we can not statically link it. This may introduce some additional compatibility challenges which we'll need to keep an eye on.	2024-01-16 12:53:52 -08:00
Bruce MacDonald	a897e833b8	do not cache prompt (#2018 ) - prompt cache causes inferance to hang after some time	2024-01-16 13:48:05 -05:00
Daniel Hiltgen	8795447dad	Merge pull request #1966 from fpreiss/fpreiss/gen_linux_cuda_detection improve cuda detection (rel. issue #1704)	2024-01-14 18:00:11 -08:00
Daniel Hiltgen	95ad9a9fc8	Merge pull request #1988 from dhiltgen/fix_intel_mac Fix typo in arm mac arch script	2024-01-14 08:45:18 -08:00
Daniel Hiltgen	3ca5f69ce8	Fix typo in arm mac arch script	2024-01-14 08:32:57 -08:00
Daniel Hiltgen	cfa6337960	Merge pull request #1982 from dhiltgen/fix_intel_mac Fix intel mac build	2024-01-14 08:26:46 -08:00
Alexander F. Rødseth	f4bf1d514f	Let gpu.go and gen_linux.sh also find CUDA on Arch Linux	2024-01-14 13:40:36 +01:00
Jeffrey Morgan	557110d0ba	Disable `mmap` with lora layers (#1985 )	2024-01-13 23:36:31 -05:00
Daniel Hiltgen	2ecb247276	Fix intel mac build Make sure we're building an x86 ext_server lib when cross-compiling	2024-01-13 14:46:34 -08:00
Jeffrey Morgan	288ef8ff95	add `gcc -lstdc++` flag for linux cpu (#1974 )	2024-01-13 03:53:00 -05:00
Jeffrey Morgan	4cf17990f7	use g++ to build `libext_server.so` on linux (#1972 )	2024-01-13 03:12:42 -05:00
Michael Yang	eaed6f8c45	add max context length check	2024-01-12 14:54:07 -08:00
Fabian Preiss	905862e17b	improve cuda detection (rel. issue #1704 )	2024-01-12 21:59:19 +01:00
Daniel Hiltgen	3773fb6465	Merge pull request #1935 from dhiltgen/cpu_fallback Fix up the CPU fallback selection	2024-01-11 15:52:32 -08:00
Daniel Hiltgen	7427fa1387	Fix up the CPU fallback selection The memory changes and multi-variant change had some merge glitches I missed. This fixes them so we actually get the cpu llm lib and best variant for the given system.	2024-01-11 15:27:06 -08:00
Michael Yang	d2be6387c9	fix typo	2024-01-11 14:25:21 -08:00
Michael Yang	d7af35d3d0	import fmt	2024-01-11 14:22:32 -08:00
Michael Yang	defc1dbd6e	use x/exp/slices	2024-01-11 14:20:13 -08:00
Daniel Hiltgen	de2fbdec99	Merge pull request #1819 from dhiltgen/multi_variant Support multiple LLM libs; ROCm v5 and v6; Rosetta, AVX, and AVX2 compatible CPU builds	2024-01-11 14:00:48 -08:00
Michael Yang	f4f939de28	Merge pull request #1552 from jmorganca/mxyng/lint-test add lint and test on pull_request	2024-01-11 09:37:45 -08:00
Daniel Hiltgen	39928a42e8	Always dynamically load the llm server library This switches darwin to dynamic loading, and refactors the code now that no static linking of the library is used on any platform	2024-01-11 08:42:47 -08:00
Daniel Hiltgen	d88c527be3	Build multiple CPU variants and pick the best This reduces the built-in linux version to not use any vector extensions which enables the resulting builds to run under Rosetta on MacOS in Docker. Then at runtime it checks for the actual CPU vector extensions and loads the best CPU library available	2024-01-11 08:42:47 -08:00
Jeffrey Morgan	ab6be852c7	revisit memory allocation to account for full kv cache on main gpu	2024-01-11 01:45:31 -05:00
Daniel Hiltgen	8da7bef05f	Support multiple variants for a given llm lib type In some cases we may want multiple variants for a given GPU type or CPU. This adds logic to have an optional Variant which we can use to select an optimal library, but also allows us to try multiple variants in case some fail to load. This can be useful for scenarios such as ROCm v5 vs v6 incompatibility or potentially CPU features.	2024-01-10 17:27:51 -08:00
Jeffrey Morgan	b24e8d17b2	Increase minimum CUDA memory allocation overhead and fix minimum overhead for multi-gpu (#1896 ) * increase minimum cuda overhead and fix minimum overhead for multi-gpu * fix multi gpu overhead * limit overhead to 10% of all gpus * better wording * allocate fixed amount before layers * fixed only includes graph alloc	2024-01-10 19:08:51 -05:00
Jeffrey Morgan	f83881390f	revert submodule back to `328b83de23b33240e28f4e74900d1d06726f5eb1`	2024-01-10 18:42:39 -05:00
Jeffrey Morgan	224fbf2795	update submodule to commit `1fc2f265ff9377a37fd2c61eae9cd813a3491bea` until its main branch is fixed	2024-01-10 17:03:15 -05:00
Jeffrey Morgan	2c6e8f5248	Update submodule to `6efb8eb30e7025b168f3fda3ff83b9b386428ad6` (#1885 ) * update submodule to `6efb8eb30e7025b168f3fda3ff83b9b386428ad6` * unblock condition variable in `update_slots` when closing server	2024-01-10 16:48:38 -05:00
Jeffrey Morgan	34344d801c	clean up cmake `build` directory when cross compiling macOS builds	2024-01-09 17:13:56 -05:00
Jeffrey Morgan	8a8c7e7f8d	only build for metal on `arm64`	2024-01-09 13:51:08 -05:00
Michael Yang	f921e2696e	typo	2024-01-09 09:45:42 -08:00
Michael Yang	4a33cede20	remove unused fields and functions	2024-01-09 09:37:40 -08:00
Michael Yang	2bb2bdd5d4	fix lint	2024-01-09 09:36:58 -08:00
Jeffrey Morgan	f387e9631b	use runner if cuda alloc won't fit	2024-01-09 00:44:34 -05:00
Jeffrey Morgan	cb534e6ac2	use 10% vram overhead for cuda	2024-01-08 23:17:44 -05:00
Jeffrey Morgan	58ce2d8273	better estimate scratch buffer size	2024-01-08 21:32:44 -05:00
Jeffrey Morgan	18ddf6d57d	fix windows build	2024-01-08 20:04:01 -05:00
Jeffrey Morgan	08f1e18965	Offload layers to GPU based on new model size estimates (#1850 ) * select layers based on estimated model memory usage * always account for scratch vram * dont load +1 layers * better estmation for graph alloc * Update gpu/gpu_darwin.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go * add overhead for cuda memory * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * fix build error on linux * address comments --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2024-01-08 16:42:00 -05:00
Jeffrey Morgan	5feec959ad	dont use `-Wall` in static build (#1833 )	2024-01-07 10:39:19 -05:00
Jeffrey Morgan	dbdd50b283	add `-DCMAKE_SYSTEM_NAME=Darwin` cmake flag (#1832 )	2024-01-07 00:46:17 -05:00
Bruce MacDonald	3367b5f3df	remove unused generate patches (#1810 )	2024-01-05 11:25:45 -05:00
Daniel Hiltgen	9983fa5f4e	Cleaup stale submodule If the tree has a stale submodule, make sure we clean it up first	2024-01-04 13:40:16 -08:00
Daniel Hiltgen	fac9060da5	Init submodule with new path	2024-01-04 13:00:13 -08:00
Daniel Hiltgen	77d96da94b	Code shuffle to clean up the llm dir	2024-01-04 12:12:05 -08:00
Daniel Hiltgen	e9ce91e9a6	Load dynamic cpu lib on windows On linux, we link the CPU library in to the Go app and fall back to it when no GPU match is found. On windows we do not link in the CPU library so that we can better control our dependencies for the CLI. This fixes the logic so we correctly fallback to the dynamic CPU library on windows.	2024-01-04 08:41:41 -08:00
Jeffrey Morgan	c0285158a9	tweak memory requirements error text	2024-01-03 19:47:18 -05:00
Jeffrey Morgan	77a66df72c	add macOS memory check for 47B models	2024-01-03 19:46:16 -05:00
Jeffrey Morgan	5b4837f881	remove unused filetype check	2024-01-03 19:45:39 -05:00
Jeffrey Morgan	29340c2e62	update cmake flags for `amd64` macOS (#1780 ) * update cmake flags for intel macOS * remove `LLAMA_K_QUANTS` * put back `CMAKE_OSX_DEPLOYMENT_TARGET` and disable `LLAMA_F16C`	2024-01-03 19:22:15 -05:00
Daniel Hiltgen	d5ec730354	Merge pull request #1779 from dhiltgen/refined_amd_gpu_list Improve maintainability of Radeon card list	2024-01-03 16:18:57 -08:00
Daniel Hiltgen	ddbfa6fe31	Fix CPU only builds Go embed doesn't like when there's no matching files, so put a dummy placeholder in to allow building without any GPU support If no "server" library is found, it's safely ignored at runtime.	2024-01-03 16:08:34 -08:00
Daniel Hiltgen	16f4603b67	Improve maintainability of Radeon card list This moves the list of AMD GPUs to an easier to maintain list which should make it easier to update over time.	2024-01-03 15:16:56 -08:00
Bruce MacDonald	0b3118e0af	fix: relay request opts to loaded llm prediction (#1761 )	2024-01-03 12:01:42 -05:00
Daniel Hiltgen	0498f7ce56	Get rid of one-line llama.log This one log line was triggering a single line llama.log to be generated in the pwd of the server	2024-01-02 15:36:16 -08:00
Daniel Hiltgen	738a8d12eb	Rename the ollama cmakefile	2024-01-02 15:36:16 -08:00
Daniel Hiltgen	d966b730ac	Switch windows build to fully dynamic Refactor where we store build outputs, and support a fully dynamic loading model on windows so the base executable has no special dependencies thus doesn't require a special PATH.	2024-01-02 15:36:16 -08:00
Daniel Hiltgen	9a70aecccb	Refactor how we augment llama.cpp This changes the model for llama.cpp inclusion so we're not applying a patch, but instead have the C++ code directly in the ollama tree, which should make it easier to refine and update over time.	2024-01-02 15:35:55 -08:00
Jeffrey Morgan	d4ebdadbe7	enable `cache_prompt` by default	2023-12-27 14:23:42 -05:00
K0IN	10da41d677	Add Cache flag to api (#1642 )	2023-12-22 17:16:20 -05:00
Daniel Hiltgen	e5202eb687	Quiet down llama.cpp logging by default By default builds will now produce non-debug and non-verbose binaries. To enable verbose logs in llama.cpp and debug symbols in the native code, set `CGO_CFLAGS=-g`	2023-12-22 08:47:18 -08:00
Daniel Hiltgen	fa24e73b82	Remove CPU build, fixup linux build script	2023-12-21 18:21:31 -08:00
Daniel Hiltgen	325d74985b	Fix CPU performance on hyperthreaded systems The default thread count logic was broken and resulted in 2x the number of threads as it should on a hyperthreading CPU resulting in thrashing and poor performance.	2023-12-21 16:23:36 -08:00
Daniel Hiltgen	d9cd3d9667	Revive windows build The windows native setup still needs some more work, but this gets it building again and if you set the PATH properly, you can run the resulting exe on a cuda system.	2023-12-20 17:21:54 -08:00
Daniel Hiltgen	7555ea44f8	Revamp the dynamic library shim This switches the default llama.cpp to be CPU based, and builds the GPU variants as dynamically loaded libraries which we can select at runtime. This also bumps the ROCm library to version 6 given 5.7 builds don't work on the latest ROCm library that just shipped.	2023-12-20 14:45:57 -08:00
Daniel Hiltgen	6558f94ed0	Fix darwin intel build	2023-12-19 13:32:24 -08:00
Daniel Hiltgen	54dbfa4c4a	Carry ggml-metal.metal as payload	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	3269535a4c	Refine handling of shim presence This allows the CPU only builds to work on systems with Radeon cards	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	1b991d0ba9	Refine build to support CPU only If someone checks out the ollama repo and doesn't install the CUDA library, this will ensure they can build a CPU only version	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	9adca7f711	Bump llama.cpp to b1662 and set n_parallel=1	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	89bbaafa64	Build linux using ubuntu 20.04 This changes the container-based linux build to use an older Ubuntu distro to improve our compatibility matrix for older user machines	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	35934b2e05	Adapted rocm support to cgo based llama.cpp	2023-12-19 09:05:46 -08:00
65a	f8ef4439e9	Use build tags to generate accelerated binaries for CUDA and ROCm on Linux. The build tags rocm or cuda must be specified to both go generate and go build. ROCm builds should have both ROCM_PATH set (and the ROCM SDK present) as well as CLBlast installed (for GGML) and CLBlast_DIR set in the environment to the CLBlast cmake directory (likely /usr/lib/cmake/CLBlast). Build tags are also used to switch VRAM detection between cuda and rocm implementations, using added "accelerator_foo.go" files which contain architecture specific functions and variables. accelerator_none is used when no tags are set, and a helper function addRunner will ignore it if it is the chosen accelerator. Fix go generate commands, thanks @deadmeu for testing.	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	d4cd695759	Add cgo implementation for llama.cpp Run the server.cpp directly inside the Go runtime via cgo while retaining the LLM Go abstractions.	2023-12-19 09:05:46 -08:00
Bruce MacDonald	811b1f03c8	deprecate ggml - remove ggml runner - automatically pull gguf models when ggml detected - tell users to update to gguf in the case automatic pull fails Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>	2023-12-19 09:05:46 -08:00
Jeffrey Morgan	6b5bdfa6c9	update runner submodule	2023-12-18 17:33:46 -05:00
Jeffrey Morgan	c063ee4af0	update runner submodule to fix hipblas build	2023-12-18 15:41:13 -05:00
Jeffrey Morgan	b85982eb91	update runner submodule	2023-12-18 12:43:31 -05:00
Bruce MacDonald	6ee8c80199	restore model load duration on generate response (#1524 ) * restore model load duration on generate response - set model load duration on generate and chat done response - calculate createAt time when response created * remove checkpoints predict opts * Update routes.go	2023-12-14 12:15:50 -05:00
Jeffrey Morgan	31f0551dab	Update runner to support mixtral and mixture of experts (MoE) (#1475 )	2023-12-13 17:15:10 -05:00
Michael Yang	4251b342de	Merge pull request #1469 from jmorganca/mxyng/model-types remove per-model types	2023-12-12 12:27:03 -08:00
Bruce MacDonald	3144e2a439	exponential back-off (#1484 )	2023-12-12 12:33:02 -05:00
Bruce MacDonald	c0960e29b5	retry on concurrent request failure (#1483 ) - remove parallel	2023-12-12 12:14:35 -05:00
Patrick Devine	910e9401d0	Multimodal support (#1216 ) --------- Co-authored-by: Matt Apperson <mattapperson@Matts-MacBook-Pro.local>	2023-12-11 13:56:22 -08:00
Michael Yang	56ffc3023a	remove per-model types mostly replaced by decoding tensors except ggml models which only support llama	2023-12-11 09:40:21 -08:00
Jeffrey Morgan	fa2f095bd9	fix model name returned by `/api/generate` being different than the model name provided	2023-12-10 11:42:15 -05:00
Jeffrey Morgan	d9a250e9b5	seek to end of file when decoding older model formats	2023-12-09 21:14:35 -05:00
Jeffrey Morgan	944519ed16	seek to eof for older model binaries	2023-12-09 20:48:57 -05:00
Jeffrey Morgan	2dd040d04c	do not use `--parallel 2` for old runners	2023-12-09 20:17:33 -05:00
Bruce MacDonald	bbe41ce41a	fix: parallel queueing race condition caused silent failure (#1445 ) * fix: queued request failures - increase parallel requests to 2 to complete queued request, queueing is managed in ollama * log steam errors	2023-12-09 14:14:02 -05:00
Michael Yang	f1b049fed8	Merge pull request #1377 from jmorganca/mxyng/qwen update for qwen	2023-12-06 12:31:51 -08:00
Michael Yang	b9495ea162	load projectors	2023-12-05 14:36:12 -08:00
Michael Yang	409bb9674e	Merge pull request #1308 from jmorganca/mxyng/split-from split from into one or more models	2023-12-05 14:33:03 -08:00
Michael Yang	d3479c07a1	Merge pull request #1250 from jmorganca/mxyng/create-layer refactor layer creation	2023-12-05 14:32:52 -08:00
Bruce MacDonald	195e3d9dbd	chat api endpoint (#1392 )	2023-12-05 14:57:33 -05:00
Jeffrey Morgan	00d06619a1	Revert "chat api (#991 )" while context variable is fixed This reverts commit `7a0899d62d`.	2023-12-04 21:16:27 -08:00
Michael Yang	5a5dca13b2	comments	2023-12-04 16:59:23 -08:00
Michael Yang	72e7a49aa9	seek instead of copyn	2023-12-04 16:59:23 -08:00
Michael Yang	2cb0fa7d40	split from into one or more models	2023-12-04 16:59:23 -08:00
Michael Yang	b2816bca67	unnecessary ReadSeeker for DecodeGGML	2023-12-04 16:59:23 -08:00
Bruce MacDonald	7a0899d62d	chat api (#991 ) - update chat docs - add messages chat endpoint - remove deprecated context and template generate parameters from docs - context and template are still supported for the time being and will continue to work as expected - add partial response to chat history	2023-12-04 18:01:06 -05:00
Michael Yang	6deebf2489	update for qwen	2023-12-04 11:38:05 -08:00
Jeffrey Morgan	16a9006306	add back `f16c` instructions on intel mac	2023-11-26 15:59:49 -05:00
Jeffrey Morgan	9e4a316405	update submodule commit	2023-11-26 14:52:00 -05:00
Jing Zhang	82b9b329ff	windows CUDA support (#1262 ) * Support cuda build in Windows * Enable dynamic NumGPU allocation for Windows	2023-11-24 17:16:36 -05:00
Jongwook Choi	12e8c12d2b	Disable CUDA peer access as a workaround for multi-gpu inference bug (#1261 ) When CUDA peer access is enabled, multi-gpu inference will produce garbage output. This is a known bug of llama.cpp (or nvidia). Until the upstream bug is fixed, we can disable CUDA peer access temporarily to ensure correct output. See #961.	2023-11-24 14:05:57 -05:00
Jeffrey Morgan	d77dde126b	consistent cpu instructions on macos and linux	2023-11-22 16:26:46 -05:00
Michael Yang	199941cd15	fix: gguf int type	2023-11-22 11:40:30 -08:00
Michael Yang	a00fac4ec8	update llama.cpp	2023-11-21 09:50:02 -08:00
Jeffrey Morgan	a3fcecf943	only set `main_gpu` if value > 0 is provided	2023-11-20 19:54:04 -05:00
Michael Yang	19b7a4d715	recent llama.cpp update added kernels for fp32, q5_0, and q5_1	2023-11-20 13:44:31 -08:00
Purinda Gunasekara	be61a81758	main-gpu argument is not getting passed to llamacpp, fixed. (#1192 )	2023-11-20 10:52:52 -05:00
Jeffrey Morgan	13ba6df5ab	enable cpu instructions on intel macs	2023-11-19 23:20:26 -05:00
Jeffrey Morgan	36a3bbf65f	Update llm/llama.go	2023-11-18 21:25:07 -05:00
Bruce MacDonald	43a726149d	fix potentially inaccurate error message	2023-11-18 21:25:07 -05:00
Jeffrey Morgan	41434a7cdc	build intel mac with correct binary and compile flags	2023-11-16 22:14:51 -05:00
Jeffrey Morgan	5cba29b9d6	JSON mode: add `"format" as an api parameter (#1051 ) * add `"format": "json"` as an API parameter --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-11-09 16:44:02 -08:00
Bruce MacDonald	1ae84bc2a2	skip gpu if less than 2GB VRAM are available (#1059 )	2023-11-09 13:16:16 -08:00
Michael Yang	c5e1bbabda	instead of static number of parameters for each model family, get the real number from the tensors (#1022 ) * parse tensor info * refactor decoder * return actual parameter count * explicit rounding * s/Human/HumanNumber/	2023-11-08 17:55:46 -08:00
Jeffrey Morgan	c44b619428	remove unused `fmt.Println`	2023-11-03 17:24:58 -07:00
Jeffrey Morgan	17678b7225	Restore system prompt on requests and default `num_keep` to `0`	2023-11-03 13:25:25 -07:00
Jeffrey Morgan	2e53704685	default rope params to 0 for new models (#968 )	2023-11-02 08:41:30 -07:00
Michael Yang	642128b75a	append LD_LIBRARY_PATH	2023-10-31 15:54:49 -07:00
Jeffrey Morgan	3a1ed9ff70	restore building runner with `AVX` on by default (#900 )	2023-10-27 12:13:44 -07:00
Bruce MacDonald	6d283882b1	catch insufficient permissions nvidia err (#934 )	2023-10-27 12:42:40 -04:00
Bruce MacDonald	2665f3c28e	offload 75% of available vram to improve stability (#921 )	2023-10-26 20:49:55 -04:00
Jeffrey Morgan	b0c9cd0f3b	fix metal assertion errors	2023-10-24 00:32:36 -07:00
Jeffrey Morgan	77f61c6301	update submodule commit	2023-10-24 00:30:27 -07:00
Jeffrey Morgan	f3604534e5	update submodule commit	2023-10-23 23:59:12 -07:00
Michael Yang	0c7a00a264	bump submodules pin to 9e70cc03229df19ca2d28ce23cc817198f897278 for now since 438c2ca83045a00ef244093d27e9ed41a8cb4ea9 is breaking	2023-10-23 11:17:59 -07:00
Michael Yang	36c160f1c3	Merge pull request #881 from jmorganca/mxyng/ggufv3 ggufv3	2023-10-23 10:50:45 -07:00
Michael Yang	c9167494cb	update default log target	2023-10-23 10:44:50 -07:00
Michael Yang	125d0a013a	ggufv3 ggufv3 adds support for big endianness, mainly for s390x architecture. while that's not currently supported for ollama, the change is simple. loosen version check to be more forward compatible. unless specified, gguf versions other v1 will be decoded into v2.	2023-10-23 09:35:49 -07:00
Jeffrey Morgan	7ed5a39bc7	simpler check for model loading compatibility errors	2023-10-19 14:50:49 -04:00
Jeffrey Morgan	a7dad24d92	add error for `falcon` and `starcoder` vocab compatibility (#844 ) add error for falcon and starcoder vocab compatibility --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-10-19 12:18:31 -04:00
Michael Yang	235e43d7f6	Merge pull request #833 from discovertomorrow/leadingspace Fix Issue with Leading Whitespaces in Decoded Context	2023-10-18 13:52:48 -07:00
Arne Müller	730996e530	use TrimPrefix instead of TrimLeft	2023-10-18 22:51:30 +02:00
Arne Müller	ce6197a8e0	removed redundant strings.CutPrefix from Decode	2023-10-18 22:47:20 +02:00
Arne Müller	46b9953f32	use strings.TrimLeft to remove spaces	2023-10-18 22:41:19 +02:00
Bruce MacDonald	565648f3f7	relay CUDA errors to the client (#825 )	2023-10-18 15:36:56 -04:00
Arne Müller	90c49bed57	moved removal of leading space into Predict	2023-10-18 20:08:26 +02:00
Arne Müller	5dc0cff459	fix whitespace removal	2023-10-18 08:15:27 +02:00
Michael Yang	08b0e04f40	Merge pull request #813 from jmorganca/mxyng/llama refactor llm/llama.go	2023-10-17 14:05:58 -07:00
Michael Yang	b36b0b71f8	use cut prefix	2023-10-17 14:01:39 -07:00
Michael Yang	094df37563	remove unused struct	2023-10-17 14:01:38 -07:00
Bruce MacDonald	f3648fd206	Update llama.cpp gguf to latest (#710 )	2023-10-17 16:55:16 -04:00
Bruce MacDonald	bd93a94abd	fix MB VRAM log output (#824 )	2023-10-17 15:35:16 -04:00
Michael Yang	f55bdb6f10	Merge pull request #799 from deichbewohner/jsonmarshaling Fix JSON Marshal Escaping for Special Characters	2023-10-17 08:46:02 -07:00
Michael Yang	2870a9bfc8	Merge pull request #812 from jmorganca/mxyng/fix-format-string fix: wrong format string type	2023-10-17 08:40:49 -07:00
Arne Müller	8fa3f366ad	Removed newline trimming and used buffer directly in POST request.	2023-10-17 08:17:35 +02:00
Michael Yang	fddb303f23	fix: format string wrong type	2023-10-16 16:14:28 -07:00
Michael Yang	cb4a80b693	fix: regression unsupported metal types omitting `--n-gpu-layers` means use metal on macos which isn't correct since ollama uses `num_gpu=0` to explicitly disable gpu for file types that are not implemented in metal	2023-10-16 14:37:20 -07:00
Arne Müller	ee94693b1a	handling unescaped json marshaling	2023-10-16 11:15:55 +02:00
Michael Yang	11d82d7b9b	update checkvram	2023-10-13 14:47:29 -07:00
Michael Yang	36fe2deebf	only check system memory on macos	2023-10-13 14:47:29 -07:00
Michael Yang	4a8931f634	check total (system + video) memory	2023-10-13 14:47:29 -07:00
Michael Yang	bd6e38fb1a	refactor memory check	2023-10-13 14:47:29 -07:00
Michael Yang	92189a5855	fix memory check	2023-10-13 14:47:29 -07:00
Michael Yang	d790bf9916	Merge pull request #783 from jmorganca/mxyng/fix-gpu-offloading fix: offloading on low end GPUs	2023-10-13 14:36:44 -07:00
Michael Yang	35afac099a	do not use gpu binary when num_gpu == 0	2023-10-13 14:32:12 -07:00
Michael Yang	811c3d1900	no gpu if vram < 2GB	2023-10-13 14:32:12 -07:00
Bruce MacDonald	6fe178134d	improve api error handling (#781 ) - remove new lines from llama.cpp error messages relayed to client - check api option types and return error on wrong type - change num layers from 95% VRAM to 92% VRAM	2023-10-13 16:57:10 -04:00
Bruce MacDonald	56497663c8	relay model runner error message to client (#720 ) * give direction to user when runner fails * also relay errors from timeout * increase timeout to 3 minutes	2023-10-12 11:16:37 -04:00
Michael Yang	b599946b74	add format bytes	2023-10-11 14:08:23 -07:00
Bruce MacDonald	77295f716e	prevent waiting on exited command (#752 ) * prevent waiting on exited command * close llama runner once	2023-10-11 12:32:13 -04:00
Bruce MacDonald	f2ba1311aa	improve vram safety with 5% vram memory buffer (#724 ) * check free memory not total * wait for subprocess to exit	2023-10-10 16:16:09 -04:00
Jeffrey Morgan	ab0668293c	llm: fix build on `amd64`	2023-10-06 14:39:54 -07:00
Bruce MacDonald	5d22319a2c	rename server subprocess (#700 ) - this makes it easier to see that the subprocess is associated with ollama	2023-10-06 10:15:42 -04:00
Bruce MacDonald	d06bc0cb6e	enable q8, q5, 5_1, and f32 for linux gpu (#699 )	2023-10-05 12:53:47 -04:00
Bruce MacDonald	9e2de1bd2c	increase streaming buffer size (#692 )	2023-10-04 14:09:00 -04:00
Michael Yang	c02c0cd483	starcoder	2023-10-02 19:56:51 -07:00
Bruce MacDonald	b1f7123301	clean up num_gpu calculation code (#673 )	2023-10-02 14:53:42 -04:00
Bruce MacDonald	1fbf3585d6	Relay default values to llama runner (#672 ) * include seed in params for llama.cpp server and remove empty filter for temp * relay default predict options to llama.cpp - reorganize options to match predict request for readability * omit empty stop --------- Co-authored-by: hallh <hallh@users.noreply.github.com>	2023-10-02 14:53:16 -04:00
Bruce MacDonald	9771b1ec51	windows runner fixes (#637 )	2023-09-29 11:47:55 -04:00
Michael Yang	f40b3de758	use int64 consistently	2023-09-28 11:07:24 -07:00
Bruce MacDonald	86279f4ae3	unbound max num gpu layers (#591 ) --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-25 18:36:46 -04:00
Michael Yang	058d0cd04b	silence warm up log	2023-09-21 14:53:33 -07:00
Michael Yang	ee1c994d15	update submodule (#567 )	2023-09-21 16:22:23 -04:00
Bruce MacDonald	4cba75efc5	remove tmp directories created by previous servers (#559 ) * remove tmp directories created by previous servers * clean up on server stop * Update routes.go * Update server/routes.go Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> * create top-level temp ollama dir * check file exists before creating --------- Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com> Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-21 20:38:49 +01:00
Michael Yang	a9ed7cc6aa	rename generate.go	2023-09-20 14:42:17 -07:00
Michael Yang	6c6a31a1e8	embed libraries using cmake	2023-09-20 14:41:57 -07:00
Bruce MacDonald	fc6ec356fc	remove libcuda.so	2023-09-20 20:36:14 +01:00
Bruce MacDonald	1255bc9b45	only package 11.8 runner	2023-09-20 20:00:41 +01:00
Bruce MacDonald	b9bb5ca288	use cuda_version	2023-09-20 17:58:16 +01:00
Bruce MacDonald	4e8be787c7	pack in cuda libs	2023-09-20 17:40:42 +01:00
Bruce MacDonald	66003e1d05	subprocess improvements (#524 ) * subprocess improvements - increase start-up timeout - when runner fails to start fail rather than timing out - try runners in order rather than choosing 1 runner - embed metal runner in metal dir rather than gpu - refactor logging and error messages * Update llama.go * Update llama.go * simplify by using glob	2023-09-18 15:16:32 -04:00
Bruce MacDonald	2540c9181c	support for packaging in multiple cuda runners (#509 ) * enable packaging multiple cuda versions * use nvcc cuda version if available --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-14 15:08:13 -04:00
Michael Yang	d028853879	fix: add falcon.go	2023-09-13 14:47:37 -07:00
Michael Yang	949553db23	Merge pull request #519 from jmorganca/mxyng/decode Mxyng/decode	2023-09-13 12:43:57 -07:00
Michael Yang	0c5a454361	fix model type for 70b	2023-09-12 15:12:59 -07:00
Bruce MacDonald	f59c4d03f7	fix ggml arm64 cuda build (#520 )	2023-09-12 17:06:48 -04:00
Michael Yang	7dee25a07f	fix falcon decode get model and file type from bin file	2023-09-12 12:34:53 -07:00
Bruce MacDonald	f221637053	first pass at linux gpu support (#454 ) * linux gpu support * handle multiple gpus * add cuda docker image (#488) --------- Co-authored-by: Michael Yang <mxyng@pm.me>	2023-09-12 11:04:35 -04:00
Bruce MacDonald	09dd2aeff9	GGUF support (#441 )	2023-09-07 13:55:37 -04:00
Jeffrey Morgan	61dda6a5e0	set minimum `CMAKE_OSX_DEPLOYMENT_TARGET` to 11.0	2023-09-06 19:56:50 -04:00
Jeffrey Morgan	7de300856b	use `osPath` in gpu check	2023-09-05 21:52:21 -04:00
Jeffrey Morgan	213ffdb548	macos `amd64` compatibility fixes	2023-09-05 21:33:31 -04:00
Bruce MacDonald	d18282bfda	metal: add missing barriers for mul-mat (#469 )	2023-09-05 19:37:13 -04:00
Michael Yang	2bc06565c7	fix empty response	2023-09-05 15:03:24 -07:00
Michael Yang	7b5aefb427	Merge pull request #462 from jmorganca/mxyng/rm-marshal-prompt remove marshalPrompt which is no longer needed	2023-09-05 11:48:41 -07:00
Jeffrey Morgan	7fa6e51686	generate binary dependencies based on GOARCH on macos (#459 )	2023-09-05 12:53:57 -04:00
Michael Yang	59a705525c	fix not forwarding last token	2023-09-03 17:46:50 -04:00
Michael Yang	5d3f314b0b	remove marshalPrompt which is no longer needed	2023-09-03 17:01:05 -04:00
Bruce MacDonald	f964aea9a2	remove test not applicate to subprocess	2023-08-30 16:36:11 -04:00
Bruce MacDonald	42998d797d	subprocess llama.cpp server (#401 ) * remove c code * pack llama.cpp * use request context for llama_cpp * let llama_cpp decide the number of threads to use * stop llama runner when app stops * remove sample count and duration metrics * use go generate to get libraries * tmp dir for running llm	2023-08-30 16:35:03 -04:00
Quinn Slack	f4432e1dba	treat stop as stop sequences, not exact tokens (#442 ) The `stop` option to the generate API is a list of sequences that should cause generation to stop. Although these are commonly called "stop tokens", they do not necessarily correspond to LLM tokens (per the LLM's tokenizer). For example, if the caller sends a generate request with `"stop":["\n"]`, then generation should stop on any token containing `\n` (and trim `\n` from the output), not just if the token exactly matches `\n`. If `stop` were interpreted strictly as LLM tokens, then it would require callers of the generate API to know the LLM's tokenizer and enumerate many tokens in the `stop` list. Fixes https://github.com/jmorganca/ollama/issues/295.	2023-08-30 11:53:42 -04:00
Michael Yang	7df342a6ea	Merge pull request #421 from jmorganca/mxyng/f16-metal allow F16 to use metal	2023-08-29 06:32:59 -07:00
Michael Yang	e82fcf30c6	Merge pull request #420 from jmorganca/mxyng/34b-mem-check add 34b to mem check	2023-08-26 14:15:52 -07:00
Michael Yang	b25dd1795d	allow F16 to use metal warning F16 uses significantly more memory than quantized model so the standard requires don't apply.	2023-08-26 08:38:48 -07:00
Michael Yang	304f2b6c96	add 34b to mem check	2023-08-26 08:29:21 -07:00
Jeffrey Morgan	177b69a211	add missing entries for 34B	2023-08-25 18:35:35 -07:00

... 5 6 7 8 9 ...

665 commits