ollama

Author	SHA1	Message	Date
Daniel Hiltgen	5bf5aeec01	Refine mmap default logic on linux If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.	2024-06-20 11:07:04 -07:00
Daniel Hiltgen	96624aa412	Merge pull request #5072 from dhiltgen/windows_path Move libraries out of users path	2024-06-19 09:13:39 -07:00
Daniel Hiltgen	7784ca33ce	Tighten up memory prediction logging Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.	2024-06-18 09:15:35 -07:00
Daniel Hiltgen	171796791f	Adjust mmap logic for cuda windows for faster model load On Windows, recent llama.cpp changes make mmap slower in most cases, so default to off. This also implements a tri-state for use_mmap so we can detect the difference between a user provided value of true/false, or unspecified.	2024-06-17 16:54:30 -07:00
Daniel Hiltgen	b2799f111b	Move libraries out of users path We update the PATH on windows to get the CLI mapped, but this has an unintended side effect of causing other apps that may use our bundled DLLs to get terminated when we upgrade.	2024-06-17 13:12:18 -07:00
Daniel Hiltgen	da3bf23354	Workaround gfx900 SDMA bugs Implement support for GPU env var workarounds, and leverage this for the Vega RX 56 which needs HSA_ENABLE_SDMA=0 set to work properly	2024-06-14 15:38:13 -07:00
Daniel Hiltgen	6f351bf586	review comments and coverage	2024-06-14 14:55:50 -07:00
Daniel Hiltgen	fc37c192ae	Refine CPU load behavior with system memory visibility	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	6fd04ca922	Improve multi-gpu handling at the limit Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block	2024-06-14 14:51:40 -07:00
Craig Hughes	b84aea1685	Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782 )	2024-06-09 10:57:09 -07:00
Michael Yang	e40145a39d	lint	2024-06-04 11:13:30 -07:00
Michael Yang	c895a7d13f	some gocritic	2024-06-04 11:13:30 -07:00
Michael Yang	829ff87bd1	revert tokenize ffi (#4761 ) * Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit `763bb65dbb`. * Revert "vocab only" This reverts commit `bf54c845e9`. * Revert "use ffi for tokenizing/detokenizing" This reverts commit `26a00a0410`.	2024-05-31 18:54:21 -07:00
Jeffrey Morgan	a50a87a7b8	partial offloading: allow flash attention and disable mmap (#4734 ) * partial offloading: allow flash attention and disable mmap * allow mmap with num_gpu=0	2024-05-30 16:58:01 -07:00
Michael Yang	26a00a0410	use ffi for tokenizing/detokenizing	2024-05-29 11:26:47 -07:00
Daniel Hiltgen	92c81e8117	Give the final model loading more time On some systems, 1 minute isn't sufficient to finish the load after it hits 100% This creates 2 distinct timers, although they're both set to the same value for now so we can refine the timeouts further.	2024-05-28 09:08:10 -07:00
Lei Jitang	7487229c34	llm/server.go: Fix 2 minor typos (#4661 ) Signed-off-by: Lei Jitang <leijitang@outlook.com>	2024-05-27 17:21:10 -07:00
Daniel Hiltgen	0165ba1651	Merge pull request #4638 from dhiltgen/better_error Report better warning on client closed abort of load	2024-05-25 14:32:28 -07:00
Daniel Hiltgen	c4209d6d21	Report better warning on client closed abort of load If the client closes the connection before we finish loading the model we abort, so lets make the log message clearer why to help users understand this failure mode	2024-05-25 09:23:28 -07:00
Patrick Devine	4cc3be3035	Move envconfig and consolidate env vars (#4608 )	2024-05-24 14:57:15 -07:00
Daniel Hiltgen	b37b496a12	Wire up load progress This doesn't expose a UX yet, but wires the initial server portion of progress reporting during load	2024-05-23 13:36:48 -07:00
Jeffrey Morgan	38255d2af1	Use flash attention flag for now (#4580 ) * put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests	2024-05-22 21:52:09 -07:00
Sam	e15307fdf4	feat: add support for flash_attn (#4120 ) * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: add flash_attn support	2024-05-20 13:36:03 -07:00
Patrick Devine	d1692fd3e0	fix the cpu estimatedTotal memory + get the expiry time for loading models (#4461 )	2024-05-15 15:43:16 -07:00
Daniel Hiltgen	853ae490e1	Sanitize the env var debug log Only dump env vars we care about in the logs	2024-05-15 14:42:57 -07:00
Patrick Devine	6845988807	Ollama `ps` command for showing currently loaded models (#4327 )	2024-05-13 17:17:36 -07:00
jmorganca	92ca2cca95	Revert "only forward some env vars" This reverts commit `ce3b212d12`.	2024-05-10 22:53:21 -07:00
Daniel Hiltgen	c4014e73a2	Fall back to CPU runner with zero layers	2024-05-10 15:09:48 -07:00
Jeffrey Morgan	bb6fd02298	Don't clamp ctx size in `PredictServerFit` (#4317 ) * dont clamp ctx size in `PredictServerFit` * minimum 4 context * remove context warning	2024-05-10 10:17:12 -07:00
Michael Yang	cf442cd57e	fix typo	2024-05-09 16:23:37 -07:00
Michael Yang	ce3b212d12	only forward some env vars	2024-05-09 15:16:09 -07:00
Michael Yang	58876091f7	log clean up	2024-05-09 14:55:36 -07:00
Daniel Hiltgen	d0425f26cf	Merge pull request #4294 from dhiltgen/harden_subprocess_reaping Harden subprocess reaping	2024-05-09 14:02:16 -07:00
Bruce MacDonald	cfa84b8470	add done_reason to the api (#4235 )	2024-05-09 13:30:14 -07:00
Daniel Hiltgen	84ac7ce139	Refine subprocess reaping	2024-05-09 11:21:31 -07:00
Daniel Hiltgen	920a4b0794	Merge remote-tracking branch 'upstream/main' into pr3702	2024-05-08 16:44:35 -07:00
Daniel Hiltgen	ee49844d09	Merge pull request #4153 from dhiltgen/gpu_verbose_response Add GPU usage	2024-05-08 16:39:11 -07:00
Daniel Hiltgen	bee2f4a3b0	Record GPU usage information This records more GPU usage information for eventual UX inclusion.	2024-05-08 14:45:39 -07:00
Daniel Hiltgen	72700279e2	Detect noexec and report a better error This will bubble up a much more informative error message if noexec is preventing us from running the subprocess	2024-05-07 16:46:15 -07:00
Daniel Hiltgen	380378cc80	Use our libraries first Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly	2024-05-06 14:23:29 -07:00
Jeffrey Morgan	ed740a2504	Fix `no slots available` error with concurrent requests (#4160 )	2024-05-06 14:22:53 -07:00
Jeffrey Morgan	1b0e6c9c0e	Fix llava models not working after first request (#4164 ) * fix llava models not working after first request * individual requests only for llava models	2024-05-05 20:50:31 -07:00
Daniel Hiltgen	f56aa20014	Centralize server config handling This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs	2024-05-05 16:49:50 -07:00
Mark Ward	321d57e1a0	Removing go routine calling .wait from load.	2024-05-01 18:51:10 +00:00
Mark Ward	ba26c7aa00	it will always return an error due to Kill() discarding Wait() errors	2024-05-01 18:51:10 +00:00
Mark Ward	63c763685f	log when the waiting for the process to stop to help debug when other tasks execute during this wait. expire timer clear the timer reference because it will not be reused. close will clean up expireTimer if calling code has not already done this.	2024-05-01 18:51:10 +00:00
Mark Ward	948114e3e3	fix sched to wait for the runner to terminate to ensure following vram check will be more accurate	2024-05-01 18:51:10 +00:00
Jeffrey Morgan	7aa08a77ca	llm: dont cap context window limit to training context window (#3988 )	2024-04-29 10:07:30 -04:00
Jeffrey Morgan	bb31def011	return code `499` when user cancels request while a model is loading (#3955 )	2024-04-26 17:38:29 -04:00
Jeffrey Morgan	993cf8bf55	llm: limit generation to 10x context size to avoid run on generations (#3918 ) * llm: limit generation to 10x context size to avoid run on generations * add comment * simplify condition statement	2024-04-25 19:02:30 -04:00

1 2

71 commits