ollama

History

Daniel Hiltgen 34b9db5afc Request and model concurrency This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.		2024-04-22 19:29:12 -07:00
..
ext_server	Support unicode characters in model path (#3681 )	2024-04-16 17:00:12 -04:00
generate	rearranged conditional logic for static build, dockerfile updated	2024-04-17 14:43:28 -04:00
llama.cpp@7593639ce3	update llama.cpp submodule to `7593639` (#3665 )	2024-04-15 23:04:43 -04:00
patches	Bump to b2581	2024-04-02 11:53:07 -07:00
ggla.go	refactor tensor query	2024-04-10 11:37:20 -07:00
ggml.go	add stablelm graph calculation	2024-04-17 13:57:19 -07:00
gguf.go	fix padding to only return padding	2024-04-16 15:43:26 -07:00
llm.go	cgo quantize	2024-04-08 15:31:08 -07:00
llm_darwin_amd64.go	Switch back to subprocessing for llama.cpp	2024-04-01 16:48:18 -07:00
llm_darwin_arm64.go	Switch back to subprocessing for llama.cpp	2024-04-01 16:48:18 -07:00
llm_linux.go	Switch back to subprocessing for llama.cpp	2024-04-01 16:48:18 -07:00
llm_windows.go	Switch back to subprocessing for llama.cpp	2024-04-01 16:48:18 -07:00
memory.go	Request and model concurrency	2024-04-22 19:29:12 -07:00
payload.go	Request and model concurrency	2024-04-22 19:29:12 -07:00
server.go	Request and model concurrency	2024-04-22 19:29:12 -07:00
status.go	Switch back to subprocessing for llama.cpp	2024-04-01 16:48:18 -07:00