ollama

Author	SHA1	Message	Date
Daniel Hiltgen	d6e3b64582	Fix concurrency for CPU mode Prior refactoring passes accidentally removed the logic to bypass VRAM checks for CPU loads. This adds that back, along with test coverage. This also fixes loaded map access in the unit test to be behind the mutex which was likely the cause of various flakes in the tests.	2024-04-28 13:42:39 -07:00
Jeffrey Morgan	bb31def011	return code `499` when user cancels request while a model is loading (#3955 )	2024-04-26 17:38:29 -04:00
Blake Mizerany	37f9c8ad99	types/model: overhaul Name and Digest types (#3924 )	2024-04-26 13:08:32 -07:00
Daniel Hiltgen	9b5a3c5991	Merge pull request #3914 from dhiltgen/mac_perf Improve mac parallel performance	2024-04-25 16:28:31 -07:00
Jeffrey Morgan	00b0699c75	Reload model if `num_gpu` changes (#3920 ) * reload model if `num_gpu` changes * dont reload on -1 * fix tests	2024-04-25 19:02:40 -04:00
Daniel Hiltgen	b123be5b71	Adjust context size for parallelism	2024-04-25 13:58:54 -07:00
Daniel Hiltgen	f503a848c2	Merge pull request #3895 from brycereitano/shiftloading Move ggml loading to when attempting to fit	2024-04-25 09:24:08 -07:00
Bryce Reitano	36a6daccab	Restructure loading conditional chain	2024-04-24 17:37:03 -06:00
Bryce Reitano	ceb0e26e5e	Provide variable ggml for TestLoad	2024-04-24 17:19:55 -06:00
Bryce Reitano	284e02bed0	Move ggml loading to when we attempt fitting	2024-04-24 17:17:24 -06:00
Michael Yang	592dae31c8	update copy to use model.Name	2024-04-24 15:54:54 -07:00
Daniel Hiltgen	d8851cb7a0	Harden sched TestLoad Give the go routine a moment to deliver the expired event	2024-04-23 16:14:47 -07:00
Daniel Hiltgen	34b9db5afc	Request and model concurrency This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.	2024-04-22 19:29:12 -07:00
Cheng	62be2050dd	chore: use errors.New to replace fmt.Errorf will much better (#3789 )	2024-04-20 22:11:06 -04:00
Patrick Devine	9f8691c6c8	Add llama2 / torch models for `ollama create` (#3607 )	2024-04-15 11:26:42 -07:00
Jeffrey Morgan	a0b8a32eb4	Terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading (#3653 ) * terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading * use `unload` in signal handler	2024-04-15 12:09:32 -04:00
Blake Mizerany	a7b431e743	server: provide helpful workaround hint when stalling on pull (#3584 ) This is a quick fix to help users who are stuck on the "pull" step at 99%. In the near future we're introducing a new registry client that should/will hopefully be smarter. In the meantime, this should unblock the users hitting issue #1736.	2024-04-10 16:24:37 -07:00
Michael Yang	9502e5661f	cgo quantize	2024-04-08 15:31:08 -07:00
Michael Yang	e1c9a2a00f	no blob create if already exists	2024-04-08 15:09:48 -07:00
Daniel Hiltgen	6589eb8a8c	Revert options as a ref in the server	2024-04-02 16:44:10 -07:00
Daniel Hiltgen	58d95cc9bd	Switch back to subprocessing for llama.cpp This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.	2024-04-01 16:48:18 -07:00
Patrick Devine	3b6a9154dd	Simplify model conversion (#3422 )	2024-04-01 16:14:53 -07:00
Michael Yang	91b3e4d282	update memory calcualtions count each layer independently when deciding gpu offloading	2024-04-01 13:16:32 -07:00
Michael Yang	d338d70492	refactor model parsing	2024-04-01 13:16:15 -07:00
Patrick Devine	5a5efee46b	Add gemma safetensors conversion (#3250 ) Co-authored-by: Michael Yang <mxyng@pm.me>	2024-03-28 18:54:01 -07:00
Michael Yang	af8a8a6b59	fix: trim quotes on OLLAMA_ORIGINS	2024-03-27 15:24:29 -07:00
Patrick Devine	1b272d5bcd	change `github.com/jmorganca/ollama` to `github.com/ollama/ollama` (#3347 )	2024-03-26 13:04:17 -07:00
Daniel Hiltgen	949b6c01e0	Revamp go based integration tests This uplevels the integration tests to run the server which can allow testing an existing server, or a remote server.	2024-03-23 14:24:18 +01:00
Blake Mizerany	703684a82a	server: replace blob prefix separator from ':' to '-' (#3146 ) This fixes issues with blob file names that contain ':' characters to be rejected by file systems that do not support them.	2024-03-14 20:18:06 -07:00
Patrick Devine	47cfe58af5	Default Keep Alive environment variable (#3094 ) --------- Co-authored-by: Chris-AS1 <8493773+Chris-AS1@users.noreply.github.com>	2024-03-13 13:29:40 -07:00
Daniel Hiltgen	4a5c9b8035	Finish unwinding idempotent payload logic The recent ROCm change partially removed idempotent payloads, but the ggml-metal.metal file for mac was still idempotent. This finishes switching to always extract the payloads, and now that idempotentcy is gone, the version directory is no longer useful.	2024-03-09 08:34:39 -08:00
Jeffrey Morgan	5b3fad9636	separate out `isLocalIP`	2024-03-09 00:22:08 -08:00
Jeffrey Morgan	bfec2c6e10	simplify host checks	2024-03-08 23:29:53 -08:00
Jeffrey Morgan	5c143af726	add additional allowed hosts	2024-03-08 23:23:59 -08:00
Jeffrey Morgan	fc8c044584	add allowed host middleware and remove `workDir` middleware (#3018 )	2024-03-08 22:23:47 -08:00
Michael Yang	76bdebbadf	decode ggla	2024-03-08 15:46:25 -08:00
Bruce MacDonald	0cebc79cba	fix: allow importing a model from name reference (#3005 )	2024-03-08 12:27:47 -05:00
Jeffrey Morgan	fc06205971	Revert "adjust download and upload concurrency based on available bandwidth" (#2995 )	2024-03-07 18:10:16 -08:00
Daniel Hiltgen	6c5ccb11f9	Revamp ROCm support This refines where we extract the LLM libraries to by adding a new OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already idempotenent, so this should speed up startups after the first time a new release is deployed. It also cleans up after itself. We now build only a single ROCm version (latest major) on both windows and linux. Given the large size of ROCms tensor files, we split the dependency out. It's bundled into the installer on windows, and a separate download on windows. The linux install script is now smart and detects the presence of AMD GPUs and looks to see if rocm v6 is already present, and if not, then downloads our dependency tar file. For Linux discovery, we now use sysfs and check each GPU against what ROCm supports so we can degrade to CPU gracefully instead of having llama.cpp+rocm assert/crash on us. For Windows, we now use go's windows dynamic library loading logic to access the amdhip64.dll APIs to query the GPU information.	2024-03-07 10:36:50 -08:00
Michael Yang	2e20110e50	Merge pull request #2221 from ollama/mxyng/up-down-ccy adjust download and upload concurrency based on available bandwidth	2024-03-07 09:27:33 -08:00
Patrick Devine	2c017ca441	Convert Safetensors to an Ollama model (#2824 )	2024-03-06 21:01:51 -08:00
Jeffrey Morgan	3b4bab3dc5	Fix embeddings load model behavior (#2848 )	2024-02-29 17:40:56 -08:00
Michael Yang	0e19476b56	prepend image tags (#2789 ) instead of appending image tags, prepend them - this generally produces better results	2024-02-29 11:30:14 -08:00
Michael Yang	084d846621	refactor	2024-02-21 13:42:48 -08:00
Michael Yang	6a4b994433	lint	2024-02-21 13:42:48 -08:00
Michael Yang	bea007deb7	use LimitGroup for uploads	2024-02-21 13:42:48 -08:00
Michael Yang	074934be03	adjust group limit based on download speed	2024-02-21 13:42:48 -08:00
Michael Yang	0de12368a0	add new LimitGroup for dynamic concurrency	2024-02-21 13:42:48 -08:00
Michael Yang	917bd61084	refactor download run	2024-02-21 13:42:46 -08:00
Jeffrey Morgan	287ba11500	better error message when calling `/api/generate` or `/api/chat` with embedding models	2024-02-20 21:53:45 -05:00

1 2 3 4 5 ...

487 commits