ollama

Author	SHA1	Message	Date
Jeffrey Morgan	4d311eb731	llm: architecture patch (#5316 )	2024-06-26 21:38:12 -07:00
Blake Mizerany	cb42e607c5	llm: speed up gguf decoding by a lot (#5246 ) Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.	2024-06-24 21:47:52 -07:00
Blake Mizerany	2aa91a937b	cmd: defer stating model info until necessary (#5248 ) This commit changes the 'ollama run' command to defer fetching model information until it really needs it. That is, when in interactive mode. It also removes one such case where the model information is fetch in duplicate, just before calling generateInteractive and then again, first thing, in generateInteractive. This positively impacts the performance of the command: ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.168 total ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.220 total ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.217 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 4% cpu 0.652 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.01s user 0.01s system 5% cpu 0.498 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with or would you like to chat? ./after run llama3 'hi' 0.01s user 0.01s system 3% cpu 0.479 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total	2024-06-24 20:14:03 -07:00
Daniel Hiltgen	ccef9431c8	Merge pull request #5205 from dhiltgen/modelfile_use_mmap Fix use_mmap parsing for modelfiles	2024-06-21 16:30:36 -07:00
Daniel Hiltgen	642cee1342	Sort the ps output Provide consistent ordering for the ps command - longest duration listed first	2024-06-21 15:59:41 -07:00
royjhan	9a9e7d83c4	Docs (#5149 )	2024-06-21 15:52:09 -07:00
Daniel Hiltgen	9929751cc8	Disable concurrency for AMD + Windows Until ROCm v6.2 ships, we wont be able to get accurate free memory reporting on windows, which makes automatic concurrency too risky. Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes. All other platforms and GPUs have accurate VRAM reporting wired up now, so we can turn on concurrency by default.	2024-06-21 15:45:05 -07:00
Daniel Hiltgen	17b7186cd7	Enable concurrency by default This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.	2024-06-21 15:45:05 -07:00
Michael Yang	189a43caa2	Merge pull request #5206 from ollama/mxyng/quantize fix: quantization with template	2024-06-21 13:44:34 -07:00
Michael Yang	e835ef1836	fix: quantization with template	2024-06-21 13:39:25 -07:00
Daniel Hiltgen	7e7749224c	Fix use_mmap parsing for modelfiles Add the new tristate parsing logic for the code path for modelfiles, as well as a unit test.	2024-06-21 12:27:19 -07:00
Daniel Hiltgen	c7c2f3bc22	Merge pull request #5194 from dhiltgen/linux_mmap_auto Refine mmap default logic on linux	2024-06-20 11:44:08 -07:00
Daniel Hiltgen	54a79d6a8a	Merge pull request #5125 from dhiltgen/fedora39 Bump latest fedora cuda repo to 39	2024-06-20 11:27:24 -07:00
Daniel Hiltgen	5bf5aeec01	Refine mmap default logic on linux If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.	2024-06-20 11:07:04 -07:00
Michael Yang	e01e535cbb	Merge pull request #5192 from ollama/mxyng/kv handle asymmetric embedding KVs	2024-06-20 10:46:24 -07:00
Josh	0195d6a2f8	Merge pull request #5188 from ollama/jyan/tmpdir2 fix: skip os.removeAll() if PID does not exist	2024-06-20 10:40:59 -07:00
Michael Yang	8e0641a9bf	handle asymmetric embedding KVs	2024-06-20 09:57:27 -07:00
Josh Yan	662568d453	err!=nil check	2024-06-20 09:30:59 -07:00
Josh Yan	4ebb66c662	reformat error check	2024-06-20 09:23:43 -07:00
Josh Yan	23e899f32d	skip os.removeAll() if PID does not exist	2024-06-20 08:51:35 -07:00
royjhan	fedf71635e	Extend api/show and ollama show to return more model info (#4881 ) * API Show Extended * Initial Draft of Information Co-Authored-By: Patrick Devine <pdevine@sonic.net> * Clean Up * Descriptive arg error messages and other fixes * Second Draft of Show with Projectors Included * Remove Chat Template * Touches * Prevent wrapping from files * Verbose functionality * Docs * Address Feedback * Lint * Resolve Conflicts * Function Name * Tests for api/show model info * Show Test File * Add Projector Test * Clean routes * Projector Check * Move Show Test * Touches * Doc update --------- Co-authored-by: Patrick Devine <pdevine@sonic.net>	2024-06-19 14:19:02 -07:00
Daniel Hiltgen	97c59be653	Merge pull request #5074 from dhiltgen/app_log_rotation Implement log rotation for tray app	2024-06-19 13:02:24 -07:00
Daniel Hiltgen	9d8a4988e8	Implement log rotation for tray app	2024-06-19 12:53:34 -07:00
Michael Yang	1ae0750a21	Merge pull request #5147 from ollama/mxyng/cleanup remove confusing log message	2024-06-19 12:50:31 -07:00
Michael Yang	9d91e5e587	remove confusing log message	2024-06-19 11:14:11 -07:00
Daniel Hiltgen	96624aa412	Merge pull request #5072 from dhiltgen/windows_path Move libraries out of users path	2024-06-19 09:13:39 -07:00
Daniel Hiltgen	10f33b8537	Merge pull request #5146 from dhiltgen/backout Put back temporary intel GPU env var	2024-06-19 09:12:45 -07:00
Daniel Hiltgen	4a633cc295	Merge pull request #5145 from dhiltgen/bad_loads Fix bad symbol load detection	2024-06-19 09:12:33 -07:00
Daniel Hiltgen	d34d88e417	Revert "Revert "gpu: add env var for detecting Intel oneapi gpus (#5076 )"" This reverts commit `755b4e4fc2`.	2024-06-19 08:57:41 -07:00
Daniel Hiltgen	52ce350b7a	Fix bad symbol load detection pointer deref's weren't correct on a few libraries, which explains some crashes on older systems or miswired symlinks for discovery libraries.	2024-06-19 08:39:07 -07:00
Daniel Hiltgen	2abebb2cbe	Merge pull request #5128 from zhewang1-intc/fix_levelzero_empty_symbol_detect Fix levelzero empty symbol detect	2024-06-19 08:33:16 -07:00
Blake Mizerany	380e06e5be	types/model: remove Digest The Digest type in its current form is awkward to work with and presents challenges with regard to how it serializes via String using the '-' prefix. We currently only use this in ollama.com, so we'll move our specific needs around digest parsing and validation there.	2024-06-18 20:28:11 -07:00
Wang,Zhe	badf975e45	get real func ptr.	2024-06-19 09:00:51 +08:00
Wang,Zhe	755b4e4fc2	Revert "gpu: add env var for detecting Intel oneapi gpus (#5076 )" This reverts commit `163cd3e77c`.	2024-06-19 08:59:58 +08:00
Daniel Hiltgen	1a1c99e334	Bump latest fedora cuda repo to 39	2024-06-18 17:13:54 -07:00
Michael Yang	21adf8b6d2	Merge pull request #5121 from ollama/mxyng/deepseekv2 deepseek v2 graph	2024-06-18 16:30:58 -07:00
Daniel Hiltgen	784bf88b0d	Wire up windows AMD driver reporting This seems to be ROCm version, not actually driver version, but it may be useful for toggling logic for VRAM reporting in the future	2024-06-18 16:22:47 -07:00
Michael Yang	e873841cbb	deepseek v2 graph	2024-06-18 15:35:12 -07:00
Daniel Hiltgen	26d0bf9236	Merge pull request #5117 from dhiltgen/fix_prediction Handle models with divergent layer sizes	2024-06-18 11:36:51 -07:00
Daniel Hiltgen	359b15a597	Handle models with divergent layer sizes The recent refactoring of the memory prediction assumed all layers are the same size, but for some models (like deepseek-coder-v2) this is not the case, so our predictions were significantly off.	2024-06-18 11:05:34 -07:00
Daniel Hiltgen	b55958a587	Merge pull request #5106 from dhiltgen/clean_logs Tighten up memory prediction logging	2024-06-18 09:24:38 -07:00
Daniel Hiltgen	7784ca33ce	Tighten up memory prediction logging Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.	2024-06-18 09:15:35 -07:00
Daniel Hiltgen	c9c8c98bf6	Merge pull request #5105 from dhiltgen/cuda_mmap Adjust mmap logic for cuda windows for faster model load	2024-06-17 17:07:30 -07:00
Daniel Hiltgen	171796791f	Adjust mmap logic for cuda windows for faster model load On Windows, recent llama.cpp changes make mmap slower in most cases, so default to off. This also implements a tri-state for use_mmap so we can detect the difference between a user provided value of true/false, or unspecified.	2024-06-17 16:54:30 -07:00
Jeffrey Morgan	176d0f7075	Update import.md	2024-06-17 19:44:14 -04:00
Daniel Hiltgen	8ed51cac37	Merge pull request #5103 from dhiltgen/faster_win_build Revert powershell jobs, but keep nvcc and cmake parallelism	2024-06-17 14:23:18 -07:00
Daniel Hiltgen	c9e6f0542d	Merge pull request #5069 from dhiltgen/ci_release Implement custom github release action	2024-06-17 13:59:37 -07:00
Daniel Hiltgen	b0930626c5	Add back lower level parallel flags nvcc supports parallelism (threads) and cmake + make can use -j, while msbuild requires /p:CL_MPcount=8	2024-06-17 13:44:46 -07:00
Daniel Hiltgen	e890be4814	Revert "More parallelism on windows generate" This reverts commit `0577af98f4`.	2024-06-17 13:32:46 -07:00
Daniel Hiltgen	b2799f111b	Move libraries out of users path We update the PATH on windows to get the CLI mapped, but this has an unintended side effect of causing other apps that may use our bundled DLLs to get terminated when we upgrade.	2024-06-17 13:12:18 -07:00

... 4 5 6 7 8 ...

3244 commits