ollama

Author	SHA1	Message	Date
Michael Yang	9685c34509	quantize any fp16/fp32 model - FROM /path/to/{safetensors,pytorch} - FROM /path/to/fp{16,32}.bin - FROM model:fp{16,32}	2024-05-06 15:24:01 -07:00
Daniel Hiltgen	380378cc80	Use our libraries first Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly	2024-05-06 14:23:29 -07:00
Jeffrey Morgan	ed740a2504	Fix `no slots available` error with concurrent requests (#4160 )	2024-05-06 14:22:53 -07:00
Jeffrey Morgan	1b0e6c9c0e	Fix llava models not working after first request (#4164 ) * fix llava models not working after first request * individual requests only for llava models	2024-05-05 20:50:31 -07:00
Daniel Hiltgen	f56aa20014	Centralize server config handling This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs	2024-05-05 16:49:50 -07:00
Michael Yang	44869c59d6	omit prompt and generate settings from final response	2024-05-03 17:00:02 -07:00
Mark Ward	321d57e1a0	Removing go routine calling .wait from load.	2024-05-01 18:51:10 +00:00
Mark Ward	ba26c7aa00	it will always return an error due to Kill() discarding Wait() errors	2024-05-01 18:51:10 +00:00
Mark Ward	63c763685f	log when the waiting for the process to stop to help debug when other tasks execute during this wait. expire timer clear the timer reference because it will not be reused. close will clean up expireTimer if calling code has not already done this.	2024-05-01 18:51:10 +00:00
Mark Ward	948114e3e3	fix sched to wait for the runner to terminate to ensure following vram check will be more accurate	2024-05-01 18:51:10 +00:00
Jeffrey Morgan	f0c454ab57	gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068 )	2024-05-01 11:46:03 -04:00
jmorganca	fcf4d60eee	llm: add back check for empty token cache	2024-04-30 17:38:44 -04:00
jmorganca	e33d5c2dbc	update llama.cpp commit to `952d03d`	2024-04-30 17:31:20 -04:00
Jeffrey Morgan	18d9a7e1f1	update llama.cpp submodule to `f364eb6` (#4060 )	2024-04-30 17:25:39 -04:00
Daniel Hiltgen	23d23409a0	Update llama.cpp (#4036 ) * Bump llama.cpp to b2761 * Adjust types for bump	2024-04-29 23:18:48 -04:00
Jeffrey Morgan	7aa08a77ca	llm: dont cap context window limit to training context window (#3988 )	2024-04-29 10:07:30 -04:00
Hernan Martinez	8a65717f55	Do not build AVX runners on ARM64	2024-04-26 23:55:32 -06:00
Hernan Martinez	b438d485f1	Use architecture specific folders in the generate script	2024-04-26 23:34:12 -06:00
Hernan Martinez	86e67fc4a9	Add import declaration for windows,arm64 to llm.go	2024-04-26 23:23:53 -06:00
Daniel Hiltgen	e4859c4563	Fine grain control over windows generate steps This will speed up CI which already tries to only build static for unit tests	2024-04-26 15:49:46 -07:00
Daniel Hiltgen	0b5c589ca2	Merge pull request #3966 from dhiltgen/bump Fix target in gen_windows.ps1	2024-04-26 15:36:53 -07:00
Michael Yang	65fadddc85	Merge pull request #3964 from ollama/mxyng/weights fix gemma, command-r layer weights	2024-04-26 15:23:33 -07:00
Daniel Hiltgen	ed5fb088c4	Fix target in gen_windows.ps1	2024-04-26 15:10:42 -07:00
Michael Yang	f81f308118	fix gemma, command-r layer weights	2024-04-26 15:00:55 -07:00
Jeffrey Morgan	bb31def011	return code `499` when user cancels request while a model is loading (#3955 )	2024-04-26 17:38:29 -04:00
Daniel Hiltgen	5c0c2d1d09	Merge pull request #3954 from dhiltgen/ci_fixes Put back non-avx CPU build for windows	2024-04-26 13:09:03 -07:00
Daniel Hiltgen	421c878a2d	Put back non-avx CPU build for windows	2024-04-26 12:44:07 -07:00
Daniel Hiltgen	85801317d1	Fix clip log import	2024-04-26 09:43:46 -07:00
Daniel Hiltgen	2ed0d65948	Bump llama.cpp to b2737	2024-04-26 09:43:28 -07:00
Daniel Hiltgen	8671fdeda6	Refactor windows generate for more modular usage	2024-04-26 08:35:50 -07:00
Daniel Hiltgen	8feb97dc0d	Move cuda/rocm dependency gathering into generate script This will make it simpler for CI to accumulate artifacts from prior steps	2024-04-25 22:38:44 -07:00
Michael Yang	de4ded68b0	Merge pull request #3923 from ollama/mxyng/mem only count output tensors	2024-04-25 16:34:17 -07:00
Daniel Hiltgen	9b5a3c5991	Merge pull request #3914 from dhiltgen/mac_perf Improve mac parallel performance	2024-04-25 16:28:31 -07:00
Jeffrey Morgan	993cf8bf55	llm: limit generation to 10x context size to avoid run on generations (#3918 ) * llm: limit generation to 10x context size to avoid run on generations * add comment * simplify condition statement	2024-04-25 19:02:30 -04:00
Michael Yang	7bb7cb8a60	only count output tensors	2024-04-25 15:24:08 -07:00
jmorganca	ddf5c09a9b	use matrix multiplcation kernels in more cases	2024-04-25 13:58:54 -07:00
Roy Yang	5f73c08729	Remove trailing spaces (#3889 )	2024-04-25 14:32:26 -04:00
Daniel Hiltgen	6e76348df7	Merge pull request #3834 from dhiltgen/not_found_in_path Report errors on server lookup instead of path lookup failure	2024-04-24 10:50:48 -07:00
Patrick Devine	14476d48cc	fixes for gguf (#3863 )	2024-04-23 20:57:20 -07:00
Daniel Hiltgen	5445aaa94e	Add back memory escape valve If we get our predictions wrong, this can be used to set a lower memory limit as a workaround. Recent multi-gpu refactoring accidentally removed it, so this adds it back.	2024-04-23 17:09:02 -07:00
Daniel Hiltgen	058f6cd2cc	Move nested payloads to installer and zip file on windows Now that the llm runner is an executable and not just a dll, more users are facing problems with security policy configurations on windows that prevent users writing to directories and then executing binaries from the same location. This change removes payloads from the main executable on windows and shifts them over to be packaged in the installer and discovered based on the executables location. This also adds a new zip file for people who want to "roll their own" installation model.	2024-04-23 16:14:47 -07:00
Daniel Hiltgen	58888a74bc	Detect and recover if runner removed Tmp cleaners can nuke the file out from underneath us. This detects the missing runner, and re-initializes the payloads.	2024-04-23 10:05:26 -07:00
Daniel Hiltgen	cc5a71e0e3	Merge pull request #3709 from remy415/custom-gpu-defs Adds support for customizing GPU build flags in llama.cpp	2024-04-23 09:28:34 -07:00
Michael Yang	e83bcf7f9a	Merge pull request #3836 from ollama/mxyng/mixtral fix: mixtral graph	2024-04-23 09:15:10 -07:00
Daniel Hiltgen	34b9db5afc	Request and model concurrency This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.	2024-04-22 19:29:12 -07:00
Daniel Hiltgen	8711d03df7	Report errors on server lookup instead of path lookup failure	2024-04-22 19:08:47 -07:00
Michael Yang	435cc866a3	fix: mixtral graph	2024-04-22 17:19:44 -07:00
Daniel Hiltgen	aa72281eae	Trim spaces and quotes from llm lib override	2024-04-22 17:11:14 -07:00
Jeremy	9c0db4cc83	Update gen_windows.ps1 Fixed improper env references	2024-04-21 16:13:41 -04:00
Cheng	62be2050dd	chore: use errors.New to replace fmt.Errorf will much better (#3789 )	2024-04-20 22:11:06 -04:00

1 2 3 4 5 ...

434 commits