ollama

Author	SHA1	Message	Date
royjhan	b9f5e16c80	Introduce `/api/embed` endpoint supporting batch embedding (#5127 ) * Initial Batch Embedding * Revert "Initial Batch Embedding" This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29. * Initial Draft * mock up notes * api/embed draft * add server function * check normalization * clean up * normalization * playing around with truncate stuff * Truncation * Truncation * move normalization to go * Integration Test Template * Truncation Integration Tests * Clean up * use float32 * move normalize * move normalize test * refactoring * integration float32 * input handling and handler testing * Refactoring of legacy and new * clear comments * merge conflicts * touches * embedding type 64 * merge conflicts * fix hanging on single string * refactoring * test values * set context length * clean up * testing clean up * testing clean up * remove function closure * Revert "remove function closure" This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787. * remove function closure * remove redundant error check * clean up * more clean up * clean up	2024-07-15 12:14:24 -07:00
Jeffrey Morgan	d8def1ff94	llm: allow gemma 2 to context shift (#5534 )	2024-07-07 13:41:51 -04:00
Jeffrey Morgan	0e09c380fc	llm: print caching notices in debug only (#5533 )	2024-07-07 12:38:04 -04:00
Jeffrey Morgan	2cc854f8cb	llm: fix missing dylibs by restoring old build behavior on Linux and macOS (#5511 ) * Revert "fix cmake build (#5505)" This reverts commit `4fd5f3526a`. * llm: fix missing dylibs by restoring old build behavior * crlf -> lf	2024-07-05 21:48:31 -04:00
Jeffrey Morgan	4fd5f3526a	fix cmake build (#5505 )	2024-07-05 19:07:01 -04:00
Jeffrey Morgan	8f8e736b13	update llama.cpp submodule to `d7fd29f` (#5475 )	2024-07-05 13:25:58 -04:00
Jeffrey Morgan	d89454de80	Use slot with cached prompt instead of least recently used (#5492 ) * Use common prefix to select slot * actually report `longest`	2024-07-05 12:32:47 -04:00
royjhan	3b5a4a77f3	Return Correct Prompt Eval Count Regardless of Cache Prompt (#5371 ) * openai compatibility * Revert "openai compatibility" This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c. * remove erroneous subtraction of prompt cache	2024-07-03 13:46:23 -07:00
Jeffrey Morgan	717f7229eb	Do not shift context for sliding window models (#5368 ) * Do not shift context for sliding window models * truncate prompt > 2/3 tokens * only target gemma2	2024-06-28 19:39:31 -07:00
Michael Yang	9d91e5e587	remove confusing log message	2024-06-19 11:14:11 -07:00
Daniel Hiltgen	fb9cdfa723	Fix server.cpp for the new cuda build macros	2024-06-14 14:51:40 -07:00
Jeffrey Morgan	ead259d877	llm: fix seed value not being applied to requests (#4986 )	2024-06-11 14:24:41 -07:00
Jeffrey Morgan	34f142797a	llm: always add bos token to prompt (#4941 ) * fix embedding by adding fixes from llama.cpp upstream * remove assert --------- Co-authored-by: Jesper Ek <deadbeef84@gmail.com>	2024-06-08 18:47:10 -07:00
Michael Yang	829ff87bd1	revert tokenize ffi (#4761 ) * Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit `763bb65dbb`. * Revert "vocab only" This reverts commit `bf54c845e9`. * Revert "use ffi for tokenizing/detokenizing" This reverts commit `26a00a0410`.	2024-05-31 18:54:21 -07:00
Michael Yang	de781b37c8	rm unused infill	2024-05-29 11:26:47 -07:00
Michael Yang	3e21799377	rm unused system prompt	2024-05-29 11:26:47 -07:00
Michael Yang	26a00a0410	use ffi for tokenizing/detokenizing	2024-05-29 11:26:47 -07:00
Michael Yang	714adb8bd1	bump (#4597 )	2024-05-23 14:16:26 -07:00
Daniel Hiltgen	b37b496a12	Wire up load progress This doesn't expose a UX yet, but wires the initial server portion of progress reporting during load	2024-05-23 13:36:48 -07:00
Sam	e15307fdf4	feat: add support for flash_attn (#4120 ) * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: add flash_attn support	2024-05-20 13:36:03 -07:00
Michael Yang	58876091f7	log clean up	2024-05-09 14:55:36 -07:00
Daniel Hiltgen	920a4b0794	Merge remote-tracking branch 'upstream/main' into pr3702	2024-05-08 16:44:35 -07:00
Michael Yang	44869c59d6	omit prompt and generate settings from final response	2024-05-03 17:00:02 -07:00
jmorganca	fcf4d60eee	llm: add back check for empty token cache	2024-04-30 17:38:44 -04:00
Jeffrey Morgan	18d9a7e1f1	update llama.cpp submodule to `f364eb6` (#4060 )	2024-04-30 17:25:39 -04:00
Daniel Hiltgen	23d23409a0	Update llama.cpp (#4036 ) * Bump llama.cpp to b2761 * Adjust types for bump	2024-04-29 23:18:48 -04:00
ManniX-ITA	c942e4a07b	Fixed startup sequence to report model loading	2024-04-17 17:40:32 +02:00
Jeffrey Morgan	7c9792a6e0	Support unicode characters in model path (#3681 ) * parse wide argv characters on windows * cleanup * move cleanup to end of `main`	2024-04-16 17:00:12 -04:00
Daniel Hiltgen	0a0e9f3e0f	Apply 01-cache.diff	2024-04-01 16:48:18 -07:00
Daniel Hiltgen	58d95cc9bd	Switch back to subprocessing for llama.cpp This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.	2024-04-01 16:48:18 -07:00
Jeffrey Morgan	f5ca7f8c8e	add license in file header for vendored llama.cpp code (#3351 )	2024-03-26 16:23:23 -04:00
Daniel Hiltgen	43799532c1	Bump llama.cpp to b2474 The release just before ggml-cuda.cu refactoring	2024-03-23 09:54:56 +01:00
Jeffrey Morgan	e95ffc7448	llama: remove server static assets (#3174 )	2024-03-15 19:24:12 -07:00
Daniel Hiltgen	85129d3a32	Adapt our build for imported server.cpp	2024-03-12 14:57:15 -07:00
Daniel Hiltgen	9ac6440da3	Import server.cpp as of b2356	2024-03-12 13:58:06 -07:00
racerole	53c107e20e	chore: fix typo (#3073 ) Signed-off-by: racerole <jiangyifeng@outlook.com>	2024-03-12 14:09:22 -04:00
Bruce MacDonald	b80661e8c7	relay load model errors to the client (#3065 )	2024-03-11 16:48:27 -04:00
Jeffrey Morgan	369eda65f5	update llama.cpp submodule to `ceca1ae` (#3064 )	2024-03-11 12:57:48 -07:00
Jeffrey Morgan	1ffb1e2874	update llama.cpp submodule to `77d1ac7` (#3030 )	2024-03-09 15:55:34 -08:00
Jeffrey Morgan	0e4669b04f	update llama.cpp submodule to `6cdabe6` (#2999 )	2024-03-08 00:26:20 -08:00
Jeffrey Morgan	21347e1ed6	update llama.cpp submodule to `c29af7e` (#2868 )	2024-03-01 15:26:04 -08:00
Jeffrey Morgan	4613a080e7	update llama.cpp submodule to `66c1968f7` (#2618 )	2024-02-20 17:42:31 -05:00
Taras Tsugrii	01ff2e14db	[nit] Remove unused msg local var. (#2511 )	2024-02-20 14:02:34 -05:00
Jeffrey Morgan	f7231ad9ad	set `shutting_down` to `false` once shutdown is complete (#2484 )	2024-02-13 17:48:41 -08:00
Daniel Hiltgen	6680761596	Shutdown faster Make sure that when a shutdown signal comes, we shutdown quickly instead of waiting for a potentially long exchange to wrap up.	2024-02-08 22:22:50 -08:00
Daniel Hiltgen	72b12c3be7	Bump llama.cpp to b1999 This requires an upstream change to support graceful termination, carried as a patch.	2024-01-30 16:52:12 -08:00
Daniel Hiltgen	730dcfcc7a	Refine debug logging for llm This wires up logging in llama.cpp to always go to stderr, and also turns up logging if OLLAMA_DEBUG is set.	2024-01-22 12:26:49 -08:00
Daniel Hiltgen	ec3764538d	Probe GPUs before backend init Detect potential error scenarios so we can fallback to CPU mode without hitting asserts.	2024-01-21 15:59:38 -08:00
Daniel Hiltgen	1b249748ab	Add multiple CPU variants for Intel Mac This also refines the build process for the ext_server build.	2024-01-17 15:08:54 -08:00
Jeffrey Morgan	557110d0ba	Disable `mmap` with lora layers (#1985 )	2024-01-13 23:36:31 -05:00

1 2

54 commits