ollama

Author	SHA1	Message	Date
Michael Yang	1eb382da5a	add phi2 mem	2024-05-10 12:13:28 -07:00
Jeffrey Morgan	bb6fd02298	Don't clamp ctx size in `PredictServerFit` (#4317 ) * dont clamp ctx size in `PredictServerFit` * minimum 4 context * remove context warning	2024-05-10 10:17:12 -07:00
Michael Yang	cf442cd57e	fix typo	2024-05-09 16:23:37 -07:00
Michael Yang	ce3b212d12	only forward some env vars	2024-05-09 15:16:09 -07:00
Michael Yang	58876091f7	log clean up	2024-05-09 14:55:36 -07:00
Daniel Hiltgen	d0425f26cf	Merge pull request #4294 from dhiltgen/harden_subprocess_reaping Harden subprocess reaping	2024-05-09 14:02:16 -07:00
Bruce MacDonald	cfa84b8470	add done_reason to the api (#4235 )	2024-05-09 13:30:14 -07:00
Daniel Hiltgen	84ac7ce139	Refine subprocess reaping	2024-05-09 11:21:31 -07:00
Daniel Hiltgen	920a4b0794	Merge remote-tracking branch 'upstream/main' into pr3702	2024-05-08 16:44:35 -07:00
Daniel Hiltgen	ee49844d09	Merge pull request #4153 from dhiltgen/gpu_verbose_response Add GPU usage	2024-05-08 16:39:11 -07:00
Daniel Hiltgen	8a516ac862	Merge pull request #4241 from dhiltgen/fix_tmp_override Detect noexec and report a better error	2024-05-08 15:34:22 -07:00
Daniel Hiltgen	bee2f4a3b0	Record GPU usage information This records more GPU usage information for eventual UX inclusion.	2024-05-08 14:45:39 -07:00
Michael Yang	eeb695261f	skip if same quantization	2024-05-07 17:44:19 -07:00
Daniel Hiltgen	72700279e2	Detect noexec and report a better error This will bubble up a much more informative error message if noexec is preventing us from running the subprocess	2024-05-07 16:46:15 -07:00
Michael Yang	1e0a669f75	Merge pull request #3682 from ollama/mxyng/quantize-all-the-things quantize any fp16/fp32 model	2024-05-07 15:20:49 -07:00
Michael Yang	4736391bfb	llm: add minimum based on layer size	2024-05-06 17:04:19 -07:00
Michael Yang	01811c176a	comments	2024-05-06 15:24:01 -07:00
Michael Yang	9685c34509	quantize any fp16/fp32 model - FROM /path/to/{safetensors,pytorch} - FROM /path/to/fp{16,32}.bin - FROM model:fp{16,32}	2024-05-06 15:24:01 -07:00
Daniel Hiltgen	380378cc80	Use our libraries first Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly	2024-05-06 14:23:29 -07:00
Jeffrey Morgan	ed740a2504	Fix `no slots available` error with concurrent requests (#4160 )	2024-05-06 14:22:53 -07:00
Jeffrey Morgan	1b0e6c9c0e	Fix llava models not working after first request (#4164 ) * fix llava models not working after first request * individual requests only for llava models	2024-05-05 20:50:31 -07:00
Daniel Hiltgen	f56aa20014	Centralize server config handling This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs	2024-05-05 16:49:50 -07:00
Michael Yang	44869c59d6	omit prompt and generate settings from final response	2024-05-03 17:00:02 -07:00
Mark Ward	321d57e1a0	Removing go routine calling .wait from load.	2024-05-01 18:51:10 +00:00
Mark Ward	ba26c7aa00	it will always return an error due to Kill() discarding Wait() errors	2024-05-01 18:51:10 +00:00
Mark Ward	63c763685f	log when the waiting for the process to stop to help debug when other tasks execute during this wait. expire timer clear the timer reference because it will not be reused. close will clean up expireTimer if calling code has not already done this.	2024-05-01 18:51:10 +00:00
Mark Ward	948114e3e3	fix sched to wait for the runner to terminate to ensure following vram check will be more accurate	2024-05-01 18:51:10 +00:00
Jeffrey Morgan	f0c454ab57	gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068 )	2024-05-01 11:46:03 -04:00
jmorganca	fcf4d60eee	llm: add back check for empty token cache	2024-04-30 17:38:44 -04:00
jmorganca	e33d5c2dbc	update llama.cpp commit to `952d03d`	2024-04-30 17:31:20 -04:00
Jeffrey Morgan	18d9a7e1f1	update llama.cpp submodule to `f364eb6` (#4060 )	2024-04-30 17:25:39 -04:00
Daniel Hiltgen	23d23409a0	Update llama.cpp (#4036 ) * Bump llama.cpp to b2761 * Adjust types for bump	2024-04-29 23:18:48 -04:00
Jeffrey Morgan	7aa08a77ca	llm: dont cap context window limit to training context window (#3988 )	2024-04-29 10:07:30 -04:00
Hernan Martinez	8a65717f55	Do not build AVX runners on ARM64	2024-04-26 23:55:32 -06:00
Hernan Martinez	b438d485f1	Use architecture specific folders in the generate script	2024-04-26 23:34:12 -06:00
Hernan Martinez	86e67fc4a9	Add import declaration for windows,arm64 to llm.go	2024-04-26 23:23:53 -06:00
Daniel Hiltgen	e4859c4563	Fine grain control over windows generate steps This will speed up CI which already tries to only build static for unit tests	2024-04-26 15:49:46 -07:00
Daniel Hiltgen	0b5c589ca2	Merge pull request #3966 from dhiltgen/bump Fix target in gen_windows.ps1	2024-04-26 15:36:53 -07:00
Michael Yang	65fadddc85	Merge pull request #3964 from ollama/mxyng/weights fix gemma, command-r layer weights	2024-04-26 15:23:33 -07:00
Daniel Hiltgen	ed5fb088c4	Fix target in gen_windows.ps1	2024-04-26 15:10:42 -07:00
Michael Yang	f81f308118	fix gemma, command-r layer weights	2024-04-26 15:00:55 -07:00
Jeffrey Morgan	bb31def011	return code `499` when user cancels request while a model is loading (#3955 )	2024-04-26 17:38:29 -04:00
Daniel Hiltgen	5c0c2d1d09	Merge pull request #3954 from dhiltgen/ci_fixes Put back non-avx CPU build for windows	2024-04-26 13:09:03 -07:00
Daniel Hiltgen	421c878a2d	Put back non-avx CPU build for windows	2024-04-26 12:44:07 -07:00
Daniel Hiltgen	85801317d1	Fix clip log import	2024-04-26 09:43:46 -07:00
Daniel Hiltgen	2ed0d65948	Bump llama.cpp to b2737	2024-04-26 09:43:28 -07:00
Daniel Hiltgen	8671fdeda6	Refactor windows generate for more modular usage	2024-04-26 08:35:50 -07:00
Daniel Hiltgen	8feb97dc0d	Move cuda/rocm dependency gathering into generate script This will make it simpler for CI to accumulate artifacts from prior steps	2024-04-25 22:38:44 -07:00
Michael Yang	de4ded68b0	Merge pull request #3923 from ollama/mxyng/mem only count output tensors	2024-04-25 16:34:17 -07:00
Daniel Hiltgen	9b5a3c5991	Merge pull request #3914 from dhiltgen/mac_perf Improve mac parallel performance	2024-04-25 16:28:31 -07:00
Jeffrey Morgan	993cf8bf55	llm: limit generation to 10x context size to avoid run on generations (#3918 ) * llm: limit generation to 10x context size to avoid run on generations * add comment * simplify condition statement	2024-04-25 19:02:30 -04:00
Michael Yang	7bb7cb8a60	only count output tensors	2024-04-25 15:24:08 -07:00
jmorganca	ddf5c09a9b	use matrix multiplcation kernels in more cases	2024-04-25 13:58:54 -07:00
Roy Yang	5f73c08729	Remove trailing spaces (#3889 )	2024-04-25 14:32:26 -04:00
Daniel Hiltgen	6e76348df7	Merge pull request #3834 from dhiltgen/not_found_in_path Report errors on server lookup instead of path lookup failure	2024-04-24 10:50:48 -07:00
Patrick Devine	14476d48cc	fixes for gguf (#3863 )	2024-04-23 20:57:20 -07:00
Daniel Hiltgen	5445aaa94e	Add back memory escape valve If we get our predictions wrong, this can be used to set a lower memory limit as a workaround. Recent multi-gpu refactoring accidentally removed it, so this adds it back.	2024-04-23 17:09:02 -07:00
Daniel Hiltgen	058f6cd2cc	Move nested payloads to installer and zip file on windows Now that the llm runner is an executable and not just a dll, more users are facing problems with security policy configurations on windows that prevent users writing to directories and then executing binaries from the same location. This change removes payloads from the main executable on windows and shifts them over to be packaged in the installer and discovered based on the executables location. This also adds a new zip file for people who want to "roll their own" installation model.	2024-04-23 16:14:47 -07:00
Daniel Hiltgen	58888a74bc	Detect and recover if runner removed Tmp cleaners can nuke the file out from underneath us. This detects the missing runner, and re-initializes the payloads.	2024-04-23 10:05:26 -07:00
Daniel Hiltgen	cc5a71e0e3	Merge pull request #3709 from remy415/custom-gpu-defs Adds support for customizing GPU build flags in llama.cpp	2024-04-23 09:28:34 -07:00
Michael Yang	e83bcf7f9a	Merge pull request #3836 from ollama/mxyng/mixtral fix: mixtral graph	2024-04-23 09:15:10 -07:00
Daniel Hiltgen	34b9db5afc	Request and model concurrency This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.	2024-04-22 19:29:12 -07:00
Daniel Hiltgen	8711d03df7	Report errors on server lookup instead of path lookup failure	2024-04-22 19:08:47 -07:00
Michael Yang	435cc866a3	fix: mixtral graph	2024-04-22 17:19:44 -07:00
Daniel Hiltgen	aa72281eae	Trim spaces and quotes from llm lib override	2024-04-22 17:11:14 -07:00
Jeremy	9c0db4cc83	Update gen_windows.ps1 Fixed improper env references	2024-04-21 16:13:41 -04:00
Cheng	62be2050dd	chore: use errors.New to replace fmt.Errorf will much better (#3789 )	2024-04-20 22:11:06 -04:00
Jeremy	6f18297b3a	Update gen_windows.ps1 Forgot a " on the write-host	2024-04-18 19:47:44 -04:00
Jeremy	15016413de	Update gen_windows.ps1 Added OLLAMA_CUSTOM_CUDA_DEFS and OLLAMA_CUSTOM_ROCM_DEFS to customize GPU builds on Windows	2024-04-18 19:27:16 -04:00
Jeremy	440b7190ed	Update gen_linux.sh Added OLLAMA_CUSTOM_CUDA_DEFS and OLLAMA_CUSTOM_ROCM_DEFS instead of OLLAMA_CUSTOM_GPU_DEFS	2024-04-18 19:18:10 -04:00
ManniX-ITA	c496967e56	Merge branch 'ollama:main' into mannix-server	2024-04-18 18:45:15 +02:00
Jeremy	3934c15895	Merge branch 'ollama:main' into custom-gpu-defs	2024-04-18 09:55:10 -04:00
Jeremy	fd048f1367	Merge branch 'ollama:main' into arm64static	2024-04-18 09:55:04 -04:00
Michael Yang	8645076a71	Merge pull request #3712 from ollama/mxyng/mem add stablelm graph calculation	2024-04-17 15:57:51 -07:00
Michael Yang	05e9424824	Merge pull request #3664 from ollama/mxyng/fix-padding-2 fix padding to only return padding	2024-04-17 15:57:40 -07:00
Michael Yang	3cf483fe48	add stablelm graph calculation	2024-04-17 13:57:19 -07:00
Jeremy	52f5370c48	add support for custom gpu build flags for llama.cpp	2024-04-17 16:00:48 -04:00
Jeremy	7c000ec3ed	adds support for OLLAMA_CUSTOM_GPU_DEFS to customize GPU build flags	2024-04-17 15:21:05 -04:00
Jeremy	ea4c284a48	Merge branch 'ollama:main' into arm64static	2024-04-17 15:11:38 -04:00
Jeremy	8aec92fa6d	rearranged conditional logic for static build, dockerfile updated	2024-04-17 14:43:28 -04:00
Michael Yang	a8b9b930b4	account for all non-repeating layers	2024-04-17 11:21:21 -07:00
Jeremy	70261b9bb6	move static build to its own flag	2024-04-17 13:04:28 -04:00
ManniX-ITA	c942e4a07b	Fixed startup sequence to report model loading	2024-04-17 17:40:32 +02:00
ManniX-ITA	bd54b08261	Streamlined WaitUntilRunning	2024-04-17 17:39:52 +02:00
Michael Yang	e74163af4c	fix padding to only return padding	2024-04-16 15:43:26 -07:00
Michael Yang	26df674785	scale graph based on gpu count	2024-04-16 14:44:13 -07:00
Jeffrey Morgan	7c9792a6e0	Support unicode characters in model path (#3681 ) * parse wide argv characters on windows * cleanup * move cleanup to end of `main`	2024-04-16 17:00:12 -04:00
Michael Yang	41a272de9f	darwin: no partial offloading if required memory greater than system	2024-04-16 11:22:38 -07:00
Jeffrey Morgan	f335722275	update llama.cpp submodule to `7593639` (#3665 )	2024-04-15 23:04:43 -04:00
Michael Yang	6d53b67c2c	Merge pull request #3663 from ollama/mxyng/fix-padding	2024-04-15 17:44:54 -07:00
Michael Yang	969238b19e	fix padding in decode TODO: update padding() to _only_ returning the padding	2024-04-15 17:27:06 -07:00
Patrick Devine	9f8691c6c8	Add llama2 / torch models for `ollama create` (#3607 )	2024-04-15 11:26:42 -07:00
Jeffrey Morgan	a0b8a32eb4	Terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading (#3653 ) * terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading * use `unload` in signal handler	2024-04-15 12:09:32 -04:00
Jeffrey Morgan	309aef7fee	update llama.cpp submodule to `4bd0f93` (#3627 )	2024-04-13 10:43:02 -07:00
Michael Yang	3397eff0cd	mixtral mem	2024-04-11 11:10:41 -07:00
Michael Yang	7e33a017c0	partial offloading	2024-04-10 11:37:20 -07:00
Michael Yang	8b2c10061c	refactor tensor query	2024-04-10 11:37:20 -07:00
Daniel Hiltgen	c5ff443b9f	Handle very slow model loads During testing, we're seeing some models take over 3 minutes.	2024-04-09 16:35:10 -07:00
Blake Mizerany	1524f323a3	Revert "build.go: introduce a friendlier way to build Ollama (#3548 )" (#3564 )	2024-04-09 15:57:45 -07:00
Blake Mizerany	fccf3eecaa	build.go: introduce a friendlier way to build Ollama (#3548 ) This commit introduces a more friendly way to build Ollama dependencies and the binary without abusing `go generate` and removing the unnecessary extra steps it brings with it. This script also provides nicer feedback to the user about what is happening during the build process. At the end, it prints a helpful message to the user about what to do next (e.g. run the new local Ollama).	2024-04-09 14:18:47 -07:00

1 2 3 4 5 ...

504 commits