ollama

Author	SHA1	Message	Date
Michael Yang	35b89b2eab	rfc: dynamic environ lookup	2024-07-22 11:25:30 -07:00
Daniel Hiltgen	a3c20e3f18	Refine error reporting for subprocess crash On windows, the exit status winds up being the search term many users search for and end up piling in on issues that are unrelated. This refines the reporting so that if we have a more detailed message we'll suppress the exit status portion of the message.	2024-07-22 08:52:16 -07:00
Daniel Hiltgen	283948c83b	Adjust windows ROCm discovery The v5 hip library returns unsupported GPUs which wont enumerate at inference time in the runner so this makes sure we align discovery. The gfx906 cards are no longer supported so we shouldn't compile with that GPU type as it wont enumerate at runtime.	2024-07-20 15:17:50 -07:00
royjhan	b9f5e16c80	Introduce `/api/embed` endpoint supporting batch embedding (#5127 ) * Initial Batch Embedding * Revert "Initial Batch Embedding" This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29. * Initial Draft * mock up notes * api/embed draft * add server function * check normalization * clean up * normalization * playing around with truncate stuff * Truncation * Truncation * move normalization to go * Integration Test Template * Truncation Integration Tests * Clean up * use float32 * move normalize * move normalize test * refactoring * integration float32 * input handling and handler testing * Refactoring of legacy and new * clear comments * merge conflicts * touches * embedding type 64 * merge conflicts * fix hanging on single string * refactoring * test values * set context length * clean up * testing clean up * testing clean up * remove function closure * Revert "remove function closure" This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787. * remove function closure * remove redundant error check * clean up * more clean up * clean up	2024-07-15 12:14:24 -07:00
Jeffrey Morgan	ef98803d63	llm: looser checks for minimum memory (#5677 )	2024-07-13 09:20:05 -07:00
Jeffrey Morgan	c4cf8ad559	llm: avoid loading model if system memory is too small (#5637 ) * llm: avoid loading model if system memory is too small * update log * Instrument swap free space On linux and windows, expose how much swap space is available so we can take that into consideration when scheduling models * use `systemSwapFreeMemory` in check --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2024-07-11 16:42:57 -07:00
Jeffrey Morgan	791650ddef	sched: only error when over-allocating system memory (#5626 )	2024-07-11 00:53:12 -07:00
Daniel Hiltgen	22c81f62ec	Remove duplicate merge glitch	2024-07-10 09:01:33 -07:00
Michael Yang	9bbddc37a7	Merge pull request #5126 from ollama/mxyng/messages update message processing	2024-07-09 09:20:44 -07:00
Jeffrey Morgan	53da2c6965	llm: remove ambiguous comment when putting upper limit on predictions to avoid infinite generation (#5535 )	2024-07-07 14:32:05 -04:00
Michael Yang	ac7a842e55	fix model reloading ensure runtime model changes (template, system prompt, messages, options) are captured on model updates without needing to reload the server	2024-07-05 13:17:25 -07:00
Daniel Hiltgen	ccd7785859	Merge pull request #5243 from dhiltgen/modelfile_use_mmap Fix use_mmap for modefiles	2024-07-03 13:59:42 -07:00
Daniel Hiltgen	0e982bc1f4	Fix corner cases on tmp cleaner on mac When ollama is running a long time, tmp cleaners can remove the runners. This tightens up a few corner cases on arm macs where we failed with "server cpu not listed in available servers map[]"	2024-07-03 13:10:14 -07:00
Josh Yan	33a65e3ba3	error	2024-07-01 16:04:13 -07:00
Daniel Hiltgen	97c9e11768	Switch use_mmap to a pointer type This uses nil as undefined for a cleaner implementation.	2024-07-01 08:44:59 -07:00
Daniel Hiltgen	3518aaef33	Merge pull request #4218 from dhiltgen/auto_parallel Enable concurrency by default	2024-07-01 08:32:29 -07:00
Blake Mizerany	cb42e607c5	llm: speed up gguf decoding by a lot (#5246 ) Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.	2024-06-24 21:47:52 -07:00
Daniel Hiltgen	17b7186cd7	Enable concurrency by default This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.	2024-06-21 15:45:05 -07:00
Daniel Hiltgen	5bf5aeec01	Refine mmap default logic on linux If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.	2024-06-20 11:07:04 -07:00
Daniel Hiltgen	96624aa412	Merge pull request #5072 from dhiltgen/windows_path Move libraries out of users path	2024-06-19 09:13:39 -07:00
Daniel Hiltgen	7784ca33ce	Tighten up memory prediction logging Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.	2024-06-18 09:15:35 -07:00
Daniel Hiltgen	171796791f	Adjust mmap logic for cuda windows for faster model load On Windows, recent llama.cpp changes make mmap slower in most cases, so default to off. This also implements a tri-state for use_mmap so we can detect the difference between a user provided value of true/false, or unspecified.	2024-06-17 16:54:30 -07:00
Daniel Hiltgen	b2799f111b	Move libraries out of users path We update the PATH on windows to get the CLI mapped, but this has an unintended side effect of causing other apps that may use our bundled DLLs to get terminated when we upgrade.	2024-06-17 13:12:18 -07:00
Daniel Hiltgen	da3bf23354	Workaround gfx900 SDMA bugs Implement support for GPU env var workarounds, and leverage this for the Vega RX 56 which needs HSA_ENABLE_SDMA=0 set to work properly	2024-06-14 15:38:13 -07:00
Daniel Hiltgen	6f351bf586	review comments and coverage	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	fc37c192ae	Refine CPU load behavior with system memory visibility	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	6fd04ca922	Improve multi-gpu handling at the limit Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block	2024-06-14 14:51:40 -07:00
Craig Hughes	b84aea1685	Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782 )	2024-06-09 10:57:09 -07:00
Michael Yang	e40145a39d	lint	2024-06-04 11:13:30 -07:00
Michael Yang	c895a7d13f	some gocritic	2024-06-04 11:13:30 -07:00
Michael Yang	829ff87bd1	revert tokenize ffi (#4761 ) * Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit `763bb65dbb`. * Revert "vocab only" This reverts commit `bf54c845e9`. * Revert "use ffi for tokenizing/detokenizing" This reverts commit `26a00a0410`.	2024-05-31 18:54:21 -07:00
Jeffrey Morgan	a50a87a7b8	partial offloading: allow flash attention and disable mmap (#4734 ) * partial offloading: allow flash attention and disable mmap * allow mmap with num_gpu=0	2024-05-30 16:58:01 -07:00
Michael Yang	26a00a0410	use ffi for tokenizing/detokenizing	2024-05-29 11:26:47 -07:00
Daniel Hiltgen	92c81e8117	Give the final model loading more time On some systems, 1 minute isn't sufficient to finish the load after it hits 100% This creates 2 distinct timers, although they're both set to the same value for now so we can refine the timeouts further.	2024-05-28 09:08:10 -07:00
Lei Jitang	7487229c34	llm/server.go: Fix 2 minor typos (#4661 ) Signed-off-by: Lei Jitang <leijitang@outlook.com>	2024-05-27 17:21:10 -07:00
Daniel Hiltgen	0165ba1651	Merge pull request #4638 from dhiltgen/better_error Report better warning on client closed abort of load	2024-05-25 14:32:28 -07:00
Daniel Hiltgen	c4209d6d21	Report better warning on client closed abort of load If the client closes the connection before we finish loading the model we abort, so lets make the log message clearer why to help users understand this failure mode	2024-05-25 09:23:28 -07:00
Patrick Devine	4cc3be3035	Move envconfig and consolidate env vars (#4608 )	2024-05-24 14:57:15 -07:00
Daniel Hiltgen	b37b496a12	Wire up load progress This doesn't expose a UX yet, but wires the initial server portion of progress reporting during load	2024-05-23 13:36:48 -07:00
Jeffrey Morgan	38255d2af1	Use flash attention flag for now (#4580 ) * put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests	2024-05-22 21:52:09 -07:00
Sam	e15307fdf4	feat: add support for flash_attn (#4120 ) * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: add flash_attn support	2024-05-20 13:36:03 -07:00
Patrick Devine	d1692fd3e0	fix the cpu estimatedTotal memory + get the expiry time for loading models (#4461 )	2024-05-15 15:43:16 -07:00
Daniel Hiltgen	853ae490e1	Sanitize the env var debug log Only dump env vars we care about in the logs	2024-05-15 14:42:57 -07:00
Patrick Devine	6845988807	Ollama `ps` command for showing currently loaded models (#4327 )	2024-05-13 17:17:36 -07:00
jmorganca	92ca2cca95	Revert "only forward some env vars" This reverts commit `ce3b212d12`.	2024-05-10 22:53:21 -07:00
Daniel Hiltgen	c4014e73a2	Fall back to CPU runner with zero layers	2024-05-10 15:09:48 -07:00
Jeffrey Morgan	bb6fd02298	Don't clamp ctx size in `PredictServerFit` (#4317 ) * dont clamp ctx size in `PredictServerFit` * minimum 4 context * remove context warning	2024-05-10 10:17:12 -07:00
Michael Yang	cf442cd57e	fix typo	2024-05-09 16:23:37 -07:00
Michael Yang	ce3b212d12	only forward some env vars	2024-05-09 15:16:09 -07:00
Michael Yang	58876091f7	log clean up	2024-05-09 14:55:36 -07:00

1 2

89 commits