ollama

Author	SHA1	Message	Date
Daniel Hiltgen	283948c83b	Adjust windows ROCm discovery The v5 hip library returns unsupported GPUs which wont enumerate at inference time in the runner so this makes sure we align discovery. The gfx906 cards are no longer supported so we shouldn't compile with that GPU type as it wont enumerate at runtime.	2024-07-20 15:17:50 -07:00
Jeffrey Morgan	1475eab95f	add patch for tekken (#5807 )	2024-07-20 13:41:21 -04:00
Michael Yang	4a565cbf94	add chat and generate tests with mock runner	2024-07-16 09:39:31 -07:00
royjhan	b9f5e16c80	Introduce `/api/embed` endpoint supporting batch embedding (#5127 ) * Initial Batch Embedding * Revert "Initial Batch Embedding" This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29. * Initial Draft * mock up notes * api/embed draft * add server function * check normalization * clean up * normalization * playing around with truncate stuff * Truncation * Truncation * move normalization to go * Integration Test Template * Truncation Integration Tests * Clean up * use float32 * move normalize * move normalize test * refactoring * integration float32 * input handling and handler testing * Refactoring of legacy and new * clear comments * merge conflicts * touches * embedding type 64 * merge conflicts * fix hanging on single string * refactoring * test values * set context length * clean up * testing clean up * testing clean up * remove function closure * Revert "remove function closure" This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787. * remove function closure * remove redundant error check * clean up * more clean up * clean up	2024-07-15 12:14:24 -07:00
Jeffrey Morgan	ef98803d63	llm: looser checks for minimum memory (#5677 )	2024-07-13 09:20:05 -07:00
Josh	10e768826c	fix: quant err message (#5616 )	2024-07-11 17:24:29 -07:00
Jeffrey Morgan	c4cf8ad559	llm: avoid loading model if system memory is too small (#5637 ) * llm: avoid loading model if system memory is too small * update log * Instrument swap free space On linux and windows, expose how much swap space is available so we can take that into consideration when scheduling models * use `systemSwapFreeMemory` in check --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2024-07-11 16:42:57 -07:00
Jeffrey Morgan	791650ddef	sched: only error when over-allocating system memory (#5626 )	2024-07-11 00:53:12 -07:00
Jeffrey Morgan	efbf41ed81	llm: dont link cuda with compat libs (#5621 )	2024-07-10 20:01:52 -07:00
Michael Yang	37a570f962	Merge pull request #5612 from ollama/mxyng/mem chatglm graph	2024-07-10 14:18:33 -07:00
Michael Yang	5a739ff4cb	chatglm graph	2024-07-10 13:43:47 -07:00
Jeffrey Morgan	4e262eb2a8	remove `GGML_CUDA_FORCE_MMQ=on` from build (#5588 )	2024-07-10 13:17:13 -07:00
Daniel Hiltgen	b50c818623	Merge pull request #5607 from dhiltgen/win_rocm_v6 Bump ROCm on windows to 6.1.2	2024-07-10 12:47:10 -07:00
Daniel Hiltgen	1f50356e8e	Bump ROCm on windows to 6.1.2 This also adjusts our algorithm to favor our bundled ROCm. I've confirmed VRAM reporting still doesn't work properly so we can't yet enable concurrency by default.	2024-07-10 11:01:22 -07:00
Daniel Hiltgen	22c81f62ec	Remove duplicate merge glitch	2024-07-10 09:01:33 -07:00
Daniel Hiltgen	2d1e3c3229	Merge pull request #5503 from dhiltgen/dual_rocm Workaround broken ROCm p2p copy	2024-07-09 15:44:16 -07:00
Daniel Hiltgen	b51e3b63ac	Statically link c++ and thread lib This makes sure we statically link the c++ and thread library on windows to avoid unnecessary runtime dependencies on non-standard DLLs	2024-07-09 11:34:30 -07:00
Michael Yang	9bbddc37a7	Merge pull request #5126 from ollama/mxyng/messages update message processing	2024-07-09 09:20:44 -07:00
Daniel Hiltgen	0bacb30007	Workaround broken ROCm p2p copy Enable the build flag for llama.cpp to use CPU copy for multi-GPU scenarios.	2024-07-08 09:40:52 -07:00
Jeffrey Morgan	53da2c6965	llm: remove ambiguous comment when putting upper limit on predictions to avoid infinite generation (#5535 )	2024-07-07 14:32:05 -04:00
Jeffrey Morgan	d8def1ff94	llm: allow gemma 2 to context shift (#5534 )	2024-07-07 13:41:51 -04:00
Jeffrey Morgan	571dc61955	Update llama.cpp submodule to `a8db2a9c` (#5530 )	2024-07-07 13:03:09 -04:00
Jeffrey Morgan	0e09c380fc	llm: print caching notices in debug only (#5533 )	2024-07-07 12:38:04 -04:00
Jeffrey Morgan	4607c70641	llm: add `-DBUILD_SHARED_LIBS=off` to common cpu cmake flags (#5520 )	2024-07-06 18:58:16 -04:00
jmorganca	a08f20d910	release: remove unwanted mingw dll.a files	2024-07-06 15:21:15 -04:00
jmorganca	6cea036027	Revert "llm: only statically link libstdc++" This reverts commit `5796bfc401`.	2024-07-06 15:10:48 -04:00
jmorganca	5796bfc401	llm: only statically link libstdc++	2024-07-06 14:06:20 -04:00
jmorganca	f1a379aa56	llm: statically link pthread and stdc++ dependencies in windows build	2024-07-06 12:54:02 -04:00
jmorganca	9ae146993e	llm: add `GGML_STATIC` flag to windows static lib	2024-07-06 03:27:05 -04:00
Jeffrey Morgan	e0348d3fe8	llm: add `COMMON_DARWIN_DEFS` to arm static build (#5513 )	2024-07-05 22:42:42 -04:00
Jeffrey Morgan	2cc854f8cb	llm: fix missing dylibs by restoring old build behavior on Linux and macOS (#5511 ) * Revert "fix cmake build (#5505)" This reverts commit `4fd5f3526a`. * llm: fix missing dylibs by restoring old build behavior * crlf -> lf	2024-07-05 21:48:31 -04:00
Jeffrey Morgan	5304b765b2	llm: put back old include dir (#5507 ) * llm: put back old include dir * llm: update link paths for old submodule commits	2024-07-05 19:34:21 -04:00
Jeffrey Morgan	4fd5f3526a	fix cmake build (#5505 )	2024-07-05 19:07:01 -04:00
Michael Yang	ac7a842e55	fix model reloading ensure runtime model changes (template, system prompt, messages, options) are captured on model updates without needing to reload the server	2024-07-05 13:17:25 -07:00
Jeffrey Morgan	78fb33dd07	fix typo in cgo directives in `llm.go` (#5501 )	2024-07-05 15:18:36 -04:00
Jeffrey Morgan	8f8e736b13	update llama.cpp submodule to `d7fd29f` (#5475 )	2024-07-05 13:25:58 -04:00
Jeffrey Morgan	d89454de80	Use slot with cached prompt instead of least recently used (#5492 ) * Use common prefix to select slot * actually report `longest`	2024-07-05 12:32:47 -04:00
Jeffrey Morgan	e9188e971a	Fix assert on small embedding inputs (#5491 ) * Fix assert on small embedding inputs * Update llm/patches/09-pooling.diff	2024-07-05 11:20:57 -04:00
Daniel Hiltgen	02c24d3d01	Merge pull request #5466 from dhiltgen/fix_clip_unicode Fix clip model loading with unicode paths	2024-07-05 08:16:58 -07:00
Jeffrey Morgan	4d71c559b2	fix error detection by limiting model loading error parsing (#5472 )	2024-07-03 20:04:30 -04:00
Daniel Hiltgen	ccd7785859	Merge pull request #5243 from dhiltgen/modelfile_use_mmap Fix use_mmap for modefiles	2024-07-03 13:59:42 -07:00
royjhan	3b5a4a77f3	Return Correct Prompt Eval Count Regardless of Cache Prompt (#5371 ) * openai compatibility * Revert "openai compatibility" This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c. * remove erroneous subtraction of prompt cache	2024-07-03 13:46:23 -07:00
Daniel Hiltgen	0e982bc1f4	Fix corner cases on tmp cleaner on mac When ollama is running a long time, tmp cleaners can remove the runners. This tightens up a few corner cases on arm macs where we failed with "server cpu not listed in available servers map[]"	2024-07-03 13:10:14 -07:00
Daniel Hiltgen	6298f49816	Fix clip model loading with unicode paths On windows, if the model dir contained unicode characters clip models would fail to load. This fixes the file name handling in clip.cpp to support utf16 on windows.	2024-07-03 12:46:36 -07:00
Josh Yan	33a65e3ba3	error	2024-07-01 16:04:13 -07:00
Daniel Hiltgen	97c9e11768	Switch use_mmap to a pointer type This uses nil as undefined for a cleaner implementation.	2024-07-01 08:44:59 -07:00
Daniel Hiltgen	3518aaef33	Merge pull request #4218 from dhiltgen/auto_parallel Enable concurrency by default	2024-07-01 08:32:29 -07:00
Jeffrey Morgan	717f7229eb	Do not shift context for sliding window models (#5368 ) * Do not shift context for sliding window models * truncate prompt > 2/3 tokens * only target gemma2	2024-06-28 19:39:31 -07:00
Michael Yang	de2163dafd	gemma2 graph	2024-06-27 13:34:52 -07:00
Jeffrey Morgan	4d311eb731	llm: architecture patch (#5316 )	2024-06-26 21:38:12 -07:00
Blake Mizerany	cb42e607c5	llm: speed up gguf decoding by a lot (#5246 ) Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.	2024-06-24 21:47:52 -07:00
Daniel Hiltgen	17b7186cd7	Enable concurrency by default This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.	2024-06-21 15:45:05 -07:00
Daniel Hiltgen	5bf5aeec01	Refine mmap default logic on linux If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.	2024-06-20 11:07:04 -07:00
Michael Yang	8e0641a9bf	handle asymmetric embedding KVs	2024-06-20 09:57:27 -07:00
Michael Yang	9d91e5e587	remove confusing log message	2024-06-19 11:14:11 -07:00
Daniel Hiltgen	96624aa412	Merge pull request #5072 from dhiltgen/windows_path Move libraries out of users path	2024-06-19 09:13:39 -07:00
Michael Yang	e873841cbb	deepseek v2 graph	2024-06-18 15:35:12 -07:00
Daniel Hiltgen	359b15a597	Handle models with divergent layer sizes The recent refactoring of the memory prediction assumed all layers are the same size, but for some models (like deepseek-coder-v2) this is not the case, so our predictions were significantly off.	2024-06-18 11:05:34 -07:00
Daniel Hiltgen	7784ca33ce	Tighten up memory prediction logging Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.	2024-06-18 09:15:35 -07:00
Daniel Hiltgen	171796791f	Adjust mmap logic for cuda windows for faster model load On Windows, recent llama.cpp changes make mmap slower in most cases, so default to off. This also implements a tri-state for use_mmap so we can detect the difference between a user provided value of true/false, or unspecified.	2024-06-17 16:54:30 -07:00
Daniel Hiltgen	b0930626c5	Add back lower level parallel flags nvcc supports parallelism (threads) and cmake + make can use -j, while msbuild requires /p:CL_MPcount=8	2024-06-17 13:44:46 -07:00
Daniel Hiltgen	e890be4814	Revert "More parallelism on windows generate" This reverts commit `0577af98f4`.	2024-06-17 13:32:46 -07:00
Daniel Hiltgen	b2799f111b	Move libraries out of users path We update the PATH on windows to get the CLI mapped, but this has an unintended side effect of causing other apps that may use our bundled DLLs to get terminated when we upgrade.	2024-06-17 13:12:18 -07:00
Jeffrey Morgan	152fc202f5	llm: update llama.cpp commit to `7c26775` (#4896 ) * llm: update llama.cpp submodule to `7c26775` * disable `LLAMA_BLAS` for now * `-DLLAMA_OPENMP=off`	2024-06-17 15:56:16 -04:00
Daniel Hiltgen	4b0050cf0e	Merge pull request #5037 from dhiltgen/faster_win_build More parallelism on windows generate	2024-06-15 08:03:05 -07:00
Daniel Hiltgen	0577af98f4	More parallelism on windows generate Make the build faster	2024-06-15 07:44:55 -07:00
Daniel Hiltgen	da3bf23354	Workaround gfx900 SDMA bugs Implement support for GPU env var workarounds, and leverage this for the Vega RX 56 which needs HSA_ENABLE_SDMA=0 set to work properly	2024-06-14 15:38:13 -07:00
Daniel Hiltgen	17df6520c8	Remove mmap related output calc logic	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	6f351bf586	review comments and coverage	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	fc37c192ae	Refine CPU load behavior with system memory visibility	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	6fd04ca922	Improve multi-gpu handling at the limit Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	fb9cdfa723	Fix server.cpp for the new cuda build macros	2024-06-14 14:51:40 -07:00
Michael Yang	217f60c3d9	Merge pull request #4987 from ollama/mxyng/revert-byte-order Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order"	2024-06-11 16:04:20 -07:00
Michael Yang	7bdcd1da94	Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order" This reverts commit `f5f245cc15`, reversing changes made to `94d37fdcae`. this change broke gguf v2 which is incorrectly detected as big endian	2024-06-11 15:56:17 -07:00
Jeffrey Morgan	ead259d877	llm: fix seed value not being applied to requests (#4986 )	2024-06-11 14:24:41 -07:00
Michael Yang	f5f245cc15	Merge pull request #4938 from ollama/mxyng/fix-byte-order fix parsing big endian gguf	2024-06-10 09:38:12 -07:00
Craig Hughes	b84aea1685	Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782 )	2024-06-09 10:57:09 -07:00
Jeffrey Morgan	34f142797a	llm: always add bos token to prompt (#4941 ) * fix embedding by adding fixes from llama.cpp upstream * remove assert --------- Co-authored-by: Jesper Ek <deadbeef84@gmail.com>	2024-06-08 18:47:10 -07:00
Michael Yang	620d5c569e	fix parsing big endian gguf	2024-06-08 12:35:26 -07:00
Daniel Hiltgen	cddc63381c	Merge pull request #4909 from dhiltgen/oneapi_disable Add ability to skip oneapi generate	2024-06-07 14:07:15 -07:00
Michael Yang	030e765e76	fix create model when template detection errors	2024-06-07 10:51:35 -07:00
Daniel Hiltgen	ab8c929e20	Add ability to skip oneapi generate This follows the same pattern for cuda and rocm to allow disabling the build even when we detect the dependent libraries	2024-06-07 08:32:49 -07:00
Jeffrey Morgan	ce0dc33cb8	llm: patch to fix qwen 2 temporarily on nvidia (#4897 )	2024-06-06 23:14:33 -07:00
Michael Yang	9b6c2e6eb6	detect chat template from KV	2024-06-06 16:03:47 -07:00
Michael Yang	6297f85606	gofmt, goimports	2024-06-04 13:20:24 -07:00
Michael Yang	e40145a39d	lint	2024-06-04 11:13:30 -07:00
Michael Yang	c895a7d13f	some gocritic	2024-06-04 11:13:30 -07:00
Michael Yang	04f3c12bb7	replace x/exp/slices with slices	2024-06-04 11:13:30 -07:00
Michael Yang	829ff87bd1	revert tokenize ffi (#4761 ) * Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit `763bb65dbb`. * Revert "vocab only" This reverts commit `bf54c845e9`. * Revert "use ffi for tokenizing/detokenizing" This reverts commit `26a00a0410`.	2024-05-31 18:54:21 -07:00
Jeffrey Morgan	763bb65dbb	use `int32_t` for call to tokenize (#4738 ) * use `int32_t` for call to tokenize * variable naming * cleanup * fix crash	2024-05-30 21:43:30 -07:00
Jeffrey Morgan	7ca9605f54	speed up tests by only building static lib (#4740 )	2024-05-30 21:43:15 -07:00
Michael Yang	eb2c443a79	Merge pull request #4736 from ollama/mxyng/vocab-only vocab only for tokenize	2024-05-30 17:21:00 -07:00
Jeffrey Morgan	a50a87a7b8	partial offloading: allow flash attention and disable mmap (#4734 ) * partial offloading: allow flash attention and disable mmap * allow mmap with num_gpu=0	2024-05-30 16:58:01 -07:00
Michael Yang	bf54c845e9	vocab only	2024-05-30 16:49:28 -07:00
Jeffrey Morgan	22f5c12ced	Update llama.cpp submodule to `5921b8f0` (#4731 ) * update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603` * add patch	2024-05-30 16:20:22 -07:00
Michael Yang	de781b37c8	rm unused infill	2024-05-29 11:26:47 -07:00
Michael Yang	3e21799377	rm unused system prompt	2024-05-29 11:26:47 -07:00
Michael Yang	26a00a0410	use ffi for tokenizing/detokenizing	2024-05-29 11:26:47 -07:00
Daniel Hiltgen	646371f56d	Merge pull request #3278 from zhewang1-intc/rebase_ollama_main Enabling ollama to run on Intel GPUs with SYCL backend	2024-05-28 16:30:50 -07:00
Daniel Hiltgen	92c81e8117	Give the final model loading more time On some systems, 1 minute isn't sufficient to finish the load after it hits 100% This creates 2 distinct timers, although they're both set to the same value for now so we can refine the timeouts further.	2024-05-28 09:08:10 -07:00

1 2 3 4 5 ...

633 commits