ollama

Author	SHA1	Message	Date
Jeffrey Morgan	f83881390f	revert submodule back to `328b83de23b33240e28f4e74900d1d06726f5eb1`	2024-01-10 18:42:39 -05:00
Jeffrey Morgan	224fbf2795	update submodule to commit `1fc2f265ff9377a37fd2c61eae9cd813a3491bea` until its main branch is fixed	2024-01-10 17:03:15 -05:00
Jeffrey Morgan	2c6e8f5248	Update submodule to `6efb8eb30e7025b168f3fda3ff83b9b386428ad6` (#1885 ) * update submodule to `6efb8eb30e7025b168f3fda3ff83b9b386428ad6` * unblock condition variable in `update_slots` when closing server	2024-01-10 16:48:38 -05:00
Jeffrey Morgan	34344d801c	clean up cmake `build` directory when cross compiling macOS builds	2024-01-09 17:13:56 -05:00
Jeffrey Morgan	8a8c7e7f8d	only build for metal on `arm64`	2024-01-09 13:51:08 -05:00
Michael Yang	f921e2696e	typo	2024-01-09 09:45:42 -08:00
Michael Yang	4a33cede20	remove unused fields and functions	2024-01-09 09:37:40 -08:00
Michael Yang	2bb2bdd5d4	fix lint	2024-01-09 09:36:58 -08:00
Jeffrey Morgan	f387e9631b	use runner if cuda alloc won't fit	2024-01-09 00:44:34 -05:00
Jeffrey Morgan	cb534e6ac2	use 10% vram overhead for cuda	2024-01-08 23:17:44 -05:00
Jeffrey Morgan	58ce2d8273	better estimate scratch buffer size	2024-01-08 21:32:44 -05:00
Jeffrey Morgan	18ddf6d57d	fix windows build	2024-01-08 20:04:01 -05:00
Jeffrey Morgan	08f1e18965	Offload layers to GPU based on new model size estimates (#1850 ) * select layers based on estimated model memory usage * always account for scratch vram * dont load +1 layers * better estmation for graph alloc * Update gpu/gpu_darwin.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * Update llm/llm.go * add overhead for cuda memory * Update llm/llm.go Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com> * fix build error on linux * address comments --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2024-01-08 16:42:00 -05:00
Jeffrey Morgan	5feec959ad	dont use `-Wall` in static build (#1833 )	2024-01-07 10:39:19 -05:00
Jeffrey Morgan	dbdd50b283	add `-DCMAKE_SYSTEM_NAME=Darwin` cmake flag (#1832 )	2024-01-07 00:46:17 -05:00
Bruce MacDonald	3367b5f3df	remove unused generate patches (#1810 )	2024-01-05 11:25:45 -05:00
Daniel Hiltgen	9983fa5f4e	Cleaup stale submodule If the tree has a stale submodule, make sure we clean it up first	2024-01-04 13:40:16 -08:00
Daniel Hiltgen	fac9060da5	Init submodule with new path	2024-01-04 13:00:13 -08:00
Daniel Hiltgen	77d96da94b	Code shuffle to clean up the llm dir	2024-01-04 12:12:05 -08:00
Daniel Hiltgen	e9ce91e9a6	Load dynamic cpu lib on windows On linux, we link the CPU library in to the Go app and fall back to it when no GPU match is found. On windows we do not link in the CPU library so that we can better control our dependencies for the CLI. This fixes the logic so we correctly fallback to the dynamic CPU library on windows.	2024-01-04 08:41:41 -08:00
Jeffrey Morgan	c0285158a9	tweak memory requirements error text	2024-01-03 19:47:18 -05:00
Jeffrey Morgan	77a66df72c	add macOS memory check for 47B models	2024-01-03 19:46:16 -05:00
Jeffrey Morgan	5b4837f881	remove unused filetype check	2024-01-03 19:45:39 -05:00
Jeffrey Morgan	29340c2e62	update cmake flags for `amd64` macOS (#1780 ) * update cmake flags for intel macOS * remove `LLAMA_K_QUANTS` * put back `CMAKE_OSX_DEPLOYMENT_TARGET` and disable `LLAMA_F16C`	2024-01-03 19:22:15 -05:00
Daniel Hiltgen	d5ec730354	Merge pull request #1779 from dhiltgen/refined_amd_gpu_list Improve maintainability of Radeon card list	2024-01-03 16:18:57 -08:00
Daniel Hiltgen	ddbfa6fe31	Fix CPU only builds Go embed doesn't like when there's no matching files, so put a dummy placeholder in to allow building without any GPU support If no "server" library is found, it's safely ignored at runtime.	2024-01-03 16:08:34 -08:00
Daniel Hiltgen	16f4603b67	Improve maintainability of Radeon card list This moves the list of AMD GPUs to an easier to maintain list which should make it easier to update over time.	2024-01-03 15:16:56 -08:00
Bruce MacDonald	0b3118e0af	fix: relay request opts to loaded llm prediction (#1761 )	2024-01-03 12:01:42 -05:00
Daniel Hiltgen	0498f7ce56	Get rid of one-line llama.log This one log line was triggering a single line llama.log to be generated in the pwd of the server	2024-01-02 15:36:16 -08:00
Daniel Hiltgen	738a8d12eb	Rename the ollama cmakefile	2024-01-02 15:36:16 -08:00
Daniel Hiltgen	d966b730ac	Switch windows build to fully dynamic Refactor where we store build outputs, and support a fully dynamic loading model on windows so the base executable has no special dependencies thus doesn't require a special PATH.	2024-01-02 15:36:16 -08:00
Daniel Hiltgen	9a70aecccb	Refactor how we augment llama.cpp This changes the model for llama.cpp inclusion so we're not applying a patch, but instead have the C++ code directly in the ollama tree, which should make it easier to refine and update over time.	2024-01-02 15:35:55 -08:00
Jeffrey Morgan	d4ebdadbe7	enable `cache_prompt` by default	2023-12-27 14:23:42 -05:00
K0IN	10da41d677	Add Cache flag to api (#1642 )	2023-12-22 17:16:20 -05:00
Daniel Hiltgen	e5202eb687	Quiet down llama.cpp logging by default By default builds will now produce non-debug and non-verbose binaries. To enable verbose logs in llama.cpp and debug symbols in the native code, set `CGO_CFLAGS=-g`	2023-12-22 08:47:18 -08:00
Daniel Hiltgen	fa24e73b82	Remove CPU build, fixup linux build script	2023-12-21 18:21:31 -08:00
Daniel Hiltgen	325d74985b	Fix CPU performance on hyperthreaded systems The default thread count logic was broken and resulted in 2x the number of threads as it should on a hyperthreading CPU resulting in thrashing and poor performance.	2023-12-21 16:23:36 -08:00
Daniel Hiltgen	d9cd3d9667	Revive windows build The windows native setup still needs some more work, but this gets it building again and if you set the PATH properly, you can run the resulting exe on a cuda system.	2023-12-20 17:21:54 -08:00
Daniel Hiltgen	7555ea44f8	Revamp the dynamic library shim This switches the default llama.cpp to be CPU based, and builds the GPU variants as dynamically loaded libraries which we can select at runtime. This also bumps the ROCm library to version 6 given 5.7 builds don't work on the latest ROCm library that just shipped.	2023-12-20 14:45:57 -08:00
Daniel Hiltgen	6558f94ed0	Fix darwin intel build	2023-12-19 13:32:24 -08:00
Daniel Hiltgen	54dbfa4c4a	Carry ggml-metal.metal as payload	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	3269535a4c	Refine handling of shim presence This allows the CPU only builds to work on systems with Radeon cards	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	1b991d0ba9	Refine build to support CPU only If someone checks out the ollama repo and doesn't install the CUDA library, this will ensure they can build a CPU only version	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	9adca7f711	Bump llama.cpp to b1662 and set n_parallel=1	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	89bbaafa64	Build linux using ubuntu 20.04 This changes the container-based linux build to use an older Ubuntu distro to improve our compatibility matrix for older user machines	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	35934b2e05	Adapted rocm support to cgo based llama.cpp	2023-12-19 09:05:46 -08:00
65a	f8ef4439e9	Use build tags to generate accelerated binaries for CUDA and ROCm on Linux. The build tags rocm or cuda must be specified to both go generate and go build. ROCm builds should have both ROCM_PATH set (and the ROCM SDK present) as well as CLBlast installed (for GGML) and CLBlast_DIR set in the environment to the CLBlast cmake directory (likely /usr/lib/cmake/CLBlast). Build tags are also used to switch VRAM detection between cuda and rocm implementations, using added "accelerator_foo.go" files which contain architecture specific functions and variables. accelerator_none is used when no tags are set, and a helper function addRunner will ignore it if it is the chosen accelerator. Fix go generate commands, thanks @deadmeu for testing.	2023-12-19 09:05:46 -08:00
Daniel Hiltgen	d4cd695759	Add cgo implementation for llama.cpp Run the server.cpp directly inside the Go runtime via cgo while retaining the LLM Go abstractions.	2023-12-19 09:05:46 -08:00
Bruce MacDonald	811b1f03c8	deprecate ggml - remove ggml runner - automatically pull gguf models when ggml detected - tell users to update to gguf in the case automatic pull fails Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>	2023-12-19 09:05:46 -08:00
Jeffrey Morgan	6b5bdfa6c9	update runner submodule	2023-12-18 17:33:46 -05:00
Jeffrey Morgan	c063ee4af0	update runner submodule to fix hipblas build	2023-12-18 15:41:13 -05:00
Jeffrey Morgan	b85982eb91	update runner submodule	2023-12-18 12:43:31 -05:00
Bruce MacDonald	6ee8c80199	restore model load duration on generate response (#1524 ) * restore model load duration on generate response - set model load duration on generate and chat done response - calculate createAt time when response created * remove checkpoints predict opts * Update routes.go	2023-12-14 12:15:50 -05:00
Jeffrey Morgan	31f0551dab	Update runner to support mixtral and mixture of experts (MoE) (#1475 )	2023-12-13 17:15:10 -05:00
Michael Yang	4251b342de	Merge pull request #1469 from jmorganca/mxyng/model-types remove per-model types	2023-12-12 12:27:03 -08:00
Bruce MacDonald	3144e2a439	exponential back-off (#1484 )	2023-12-12 12:33:02 -05:00
Bruce MacDonald	c0960e29b5	retry on concurrent request failure (#1483 ) - remove parallel	2023-12-12 12:14:35 -05:00
Patrick Devine	910e9401d0	Multimodal support (#1216 ) --------- Co-authored-by: Matt Apperson <mattapperson@Matts-MacBook-Pro.local>	2023-12-11 13:56:22 -08:00
Michael Yang	56ffc3023a	remove per-model types mostly replaced by decoding tensors except ggml models which only support llama	2023-12-11 09:40:21 -08:00
Jeffrey Morgan	fa2f095bd9	fix model name returned by `/api/generate` being different than the model name provided	2023-12-10 11:42:15 -05:00
Jeffrey Morgan	d9a250e9b5	seek to end of file when decoding older model formats	2023-12-09 21:14:35 -05:00
Jeffrey Morgan	944519ed16	seek to eof for older model binaries	2023-12-09 20:48:57 -05:00
Jeffrey Morgan	2dd040d04c	do not use `--parallel 2` for old runners	2023-12-09 20:17:33 -05:00
Bruce MacDonald	bbe41ce41a	fix: parallel queueing race condition caused silent failure (#1445 ) * fix: queued request failures - increase parallel requests to 2 to complete queued request, queueing is managed in ollama * log steam errors	2023-12-09 14:14:02 -05:00
Michael Yang	f1b049fed8	Merge pull request #1377 from jmorganca/mxyng/qwen update for qwen	2023-12-06 12:31:51 -08:00
Michael Yang	b9495ea162	load projectors	2023-12-05 14:36:12 -08:00
Michael Yang	409bb9674e	Merge pull request #1308 from jmorganca/mxyng/split-from split from into one or more models	2023-12-05 14:33:03 -08:00
Michael Yang	d3479c07a1	Merge pull request #1250 from jmorganca/mxyng/create-layer refactor layer creation	2023-12-05 14:32:52 -08:00
Bruce MacDonald	195e3d9dbd	chat api endpoint (#1392 )	2023-12-05 14:57:33 -05:00
Jeffrey Morgan	00d06619a1	Revert "chat api (#991 )" while context variable is fixed This reverts commit `7a0899d62d`.	2023-12-04 21:16:27 -08:00
Michael Yang	5a5dca13b2	comments	2023-12-04 16:59:23 -08:00
Michael Yang	72e7a49aa9	seek instead of copyn	2023-12-04 16:59:23 -08:00
Michael Yang	2cb0fa7d40	split from into one or more models	2023-12-04 16:59:23 -08:00
Michael Yang	b2816bca67	unnecessary ReadSeeker for DecodeGGML	2023-12-04 16:59:23 -08:00
Bruce MacDonald	7a0899d62d	chat api (#991 ) - update chat docs - add messages chat endpoint - remove deprecated context and template generate parameters from docs - context and template are still supported for the time being and will continue to work as expected - add partial response to chat history	2023-12-04 18:01:06 -05:00
Michael Yang	6deebf2489	update for qwen	2023-12-04 11:38:05 -08:00
Jeffrey Morgan	16a9006306	add back `f16c` instructions on intel mac	2023-11-26 15:59:49 -05:00
Jeffrey Morgan	9e4a316405	update submodule commit	2023-11-26 14:52:00 -05:00
Jing Zhang	82b9b329ff	windows CUDA support (#1262 ) * Support cuda build in Windows * Enable dynamic NumGPU allocation for Windows	2023-11-24 17:16:36 -05:00
Jongwook Choi	12e8c12d2b	Disable CUDA peer access as a workaround for multi-gpu inference bug (#1261 ) When CUDA peer access is enabled, multi-gpu inference will produce garbage output. This is a known bug of llama.cpp (or nvidia). Until the upstream bug is fixed, we can disable CUDA peer access temporarily to ensure correct output. See #961.	2023-11-24 14:05:57 -05:00
Jeffrey Morgan	d77dde126b	consistent cpu instructions on macos and linux	2023-11-22 16:26:46 -05:00
Michael Yang	199941cd15	fix: gguf int type	2023-11-22 11:40:30 -08:00
Michael Yang	a00fac4ec8	update llama.cpp	2023-11-21 09:50:02 -08:00
Jeffrey Morgan	a3fcecf943	only set `main_gpu` if value > 0 is provided	2023-11-20 19:54:04 -05:00
Michael Yang	19b7a4d715	recent llama.cpp update added kernels for fp32, q5_0, and q5_1	2023-11-20 13:44:31 -08:00
Purinda Gunasekara	be61a81758	main-gpu argument is not getting passed to llamacpp, fixed. (#1192 )	2023-11-20 10:52:52 -05:00
Jeffrey Morgan	13ba6df5ab	enable cpu instructions on intel macs	2023-11-19 23:20:26 -05:00
Jeffrey Morgan	36a3bbf65f	Update llm/llama.go	2023-11-18 21:25:07 -05:00
Bruce MacDonald	43a726149d	fix potentially inaccurate error message	2023-11-18 21:25:07 -05:00
Jeffrey Morgan	41434a7cdc	build intel mac with correct binary and compile flags	2023-11-16 22:14:51 -05:00
Jeffrey Morgan	5cba29b9d6	JSON mode: add `"format" as an api parameter (#1051 ) * add `"format": "json"` as an API parameter --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2023-11-09 16:44:02 -08:00
Bruce MacDonald	1ae84bc2a2	skip gpu if less than 2GB VRAM are available (#1059 )	2023-11-09 13:16:16 -08:00
Michael Yang	c5e1bbabda	instead of static number of parameters for each model family, get the real number from the tensors (#1022 ) * parse tensor info * refactor decoder * return actual parameter count * explicit rounding * s/Human/HumanNumber/	2023-11-08 17:55:46 -08:00
Jeffrey Morgan	c44b619428	remove unused `fmt.Println`	2023-11-03 17:24:58 -07:00
Jeffrey Morgan	17678b7225	Restore system prompt on requests and default `num_keep` to `0`	2023-11-03 13:25:25 -07:00
Jeffrey Morgan	2e53704685	default rope params to 0 for new models (#968 )	2023-11-02 08:41:30 -07:00
Michael Yang	642128b75a	append LD_LIBRARY_PATH	2023-10-31 15:54:49 -07:00
Jeffrey Morgan	3a1ed9ff70	restore building runner with `AVX` on by default (#900 )	2023-10-27 12:13:44 -07:00
Bruce MacDonald	6d283882b1	catch insufficient permissions nvidia err (#934 )	2023-10-27 12:42:40 -04:00
Bruce MacDonald	2665f3c28e	offload 75% of available vram to improve stability (#921 )	2023-10-26 20:49:55 -04:00

1 2 3 4 5

250 commits