baalajimaestro/llama.cpp

Author	SHA1	Message	Date
nullname	d634efcdd9	feat: adding `rpc_servers` parameter to `Llama` class (#1477 ) * passthru rpc_servers params wip * enable llama rpc by default * convert string to byte * add rpc package * Revert "enable llama rpc by default" This reverts commit 832c6dd56c979514cec5df224bf2d2014dccd790. * update readme * Only set rpc_servers when provided * Add rpc servers to server options --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-06-04 10:38:21 -04:00
Andrei Betlen	df45a4b3fe	fix: fix string value kv_overrides. Closes #1487	2024-05-29 02:02:22 -04:00
twaka	5212fb08ae	feat: add MinTokensLogitProcessor and min_tokens argument to server (#1333 ) * implement min_tokens * set default to 0 * pass min_tokens * fix * remove copy * implement MinTokensLogitsProcessor * format * fix condition	2024-05-14 09:50:53 -04:00
Andrei Betlen	0318702cdc	feat(server): Add support for setting root_path. Closes #1420	2024-05-05 12:49:31 -04:00
Andrei Betlen	f9b7221c8f	Merge branch 'main' of github.com:abetlen/llama_cpp_python into main	2024-05-03 19:07:54 -04:00
Andrei Betlen	0a454bebe6	feat(server): Remove temperature bounds checks for server. Closes #1384	2024-05-03 15:23:06 -04:00
Daniel Thuerck	2138561fab	fix(server): Propagate `flash_attn` to model load. (#1424 )	2024-05-03 12:17:07 -04:00
Andrei Betlen	31b1d95a6c	feat: Add llama-3-vision-alpha chat format	2024-05-02 11:32:18 -04:00
Andrei Betlen	22d77eefd2	feat: Add option to enable `flash_attn` to Lllama params and ModelSettings	2024-04-30 09:29:16 -04:00
Andrei	fe2da09538	feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) (#1147 ) * Test dummy image tags in chat templates * Format and improve types for llava_cpp.py * Add from_pretrained support to llava chat format. * Refactor llava chat format to use a jinja2 * Revert chat format test * Add moondream support (wip) * Update moondream chat format * Update moondream chat format * Update moondream prompt * Add function calling support * Cache last image embed * Add Llava1.6 support * Add nanollava support * Add obisidian support * Remove unnecessary import * Re-order multimodal chat formats * Logits all no longer required for multi-modal models * Update README.md * Update docs * Update README * Fix typo * Update README * Fix typo	2024-04-30 01:35:38 -04:00
Andrei Betlen	fcfea66857	fix: pydantic deprecation warning	2024-04-25 21:21:48 -04:00
Sean Bailey	53ebcc8bb5	feat(server): Provide ability to dynamically allocate all threads if desired using `-1` (#1364 )	2024-04-23 02:35:38 -04:00
khimaros	b73c73c0c6	feat: add `disable_ping_events` flag (#1257 ) for backward compatibility, this is false by default it can be set to true to disable EventSource pings which are not supported by some OpenAI clients. fixes https://github.com/abetlen/llama-cpp-python/issues/1256	2024-04-17 10:08:19 -04:00
ddh0	c96b2daebf	feat: Use all available CPUs for batch processing (#1345 )	2024-04-17 10:05:54 -04:00
Andrei Betlen	060bfa64d5	feat: Add support for yaml based configs	2024-04-10 02:47:01 -04:00
Limour	f165048a69	feat: add support for KV cache quantization options (#1307 ) * add KV cache quantization options https://github.com/abetlen/llama-cpp-python/discussions/1220 https://github.com/abetlen/llama-cpp-python/issues/1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-04-01 10:19:28 -04:00
windspirit95	aa9f1ae011	feat: Add logprobs support to chat completions (#1311 ) * Add logprobs return in ChatCompletionResponse * Fix duplicate field * Set default to false * Simplify check * Add server example --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-03-31 13:30:13 -04:00
Andrei Betlen	d11ccc3036	fix(server): minor type fixes	2024-03-23 17:14:15 -04:00
Andrei Betlen	f7decc9562	docs: Add chat examples to openapi ui	2024-03-19 10:52:53 -04:00
Felipe Lorenz	c139f8b5d5	feat: Add endpoints for tokenize, detokenize and count tokens (#1136 ) * Add endpoint to count tokens * Add tokenize and detokenize endpoints * Change response key to tokens for tokenize endpoint * Fix dependency bug * Cleanup * Remove example added by mistake * Move tokenize, detokenize, and count to Extras namespace. Tag existing endpoints --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-03-08 21:09:00 -05:00
Andrei Betlen	727d60c28a	misc: Format	2024-02-28 14:27:40 -05:00
Andrei Betlen	0d37ce52b1	feat: Update llama.cpp	2024-02-28 14:27:16 -05:00
Andrei	4d574bd765	feat(server): Add support for pulling models from Huggingface Hub (#1222 ) * Basic support for hf pull on server * Add hf_model_repo_id setting * Update README	2024-02-26 14:35:08 -05:00
Andrei Betlen	dcf38f6141	fix: remove prematurely commited change	2024-02-25 21:00:37 -05:00
Andrei Betlen	2292af5796	feat: Update llama.cpp	2024-02-25 16:53:58 -05:00
Andrei Betlen	fdce078cb9	feat: Update llama.cpp	2024-02-17 00:37:51 -05:00
khimaros	ea1f88dd29	fix: Use '\n' seperator for EventSourceResponse (#1188 ) this fixes compatibility with some OpenAI clients, including BetterChatGPT (https://github.com/ztjhz/BetterChatGPT/issues/537). Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-15 15:20:13 -05:00
Andrei Betlen	85d3374b4d	fix: broken import	2024-02-08 01:13:28 -05:00
Jeffrey Fong	901827013b	feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class (#1078 ) * convert functionary-v1 chat handler to use hf autotokenizer * add hf_tokenizer + inteegrate functionary-v1.4 prompt template * integrate functionary v2 prompt template * update readme * set up parallel function calling wip * set up parallel function calling * Update README.md * Update README.md * refactor tokenizers * include old functionary handler for backward compatibility * add hf_tokenizer_path in server ModelSettings * convert functionary-v1 chat handler to use hf autotokenizer * add hf_tokenizer + inteegrate functionary-v1.4 prompt template * integrate functionary v2 prompt template * update readme * set up parallel function calling wip * resolve merge conflict * Update README.md * Update README.md * refactor tokenizers * include old functionary handler for backward compatibility * add hf_tokenizer_path in server ModelSettings * Cleanup PR, fix breaking changes * Use hf_pretrained_model_name_or_path for tokenizer * fix hf tokenizer in streaming * update README * refactor offset mapping --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-07 20:07:03 -05:00
Andrei	fb762a6041	Add speculative decoding (#1120 ) * Add draft model param to llama class, implement basic prompt lookup decoding draft model * Use samplingcontext for sampling * Use 1d array * Use draft model for sampling * Fix dumb mistake * Allow for later extensions to the LlamaDraftModel api * Cleanup * Adaptive candidate prediction * Update implementation to match hf transformers * Tuning * Fix bug where last token was not used for ngram prediction * Remove heuristic for num_pred_tokens (no benefit) * fix: n_candidates bug. * Add draft_model_num_pred_tokens server setting * Cleanup * Update README	2024-01-31 14:08:14 -05:00
Andrei	da003d8768	Automatically set chat format from gguf (#1110 ) * Use jinja formatter to load chat format from gguf * Fix off-by-one error in metadata loader * Implement chat format auto-detection	2024-01-29 14:22:23 -05:00
Andrei Betlen	cde7514c3d	feat(server): include llama-cpp-python version in openapi spec	2024-01-25 11:23:18 -05:00
Andrei Betlen	24f39454e9	fix: pass chat handler not chat formatter for huggingface autotokenizer and tokenizer_config formats.	2024-01-21 18:38:04 -05:00
Andrei Betlen	141293a75b	Fix python3.8 support	2024-01-19 08:17:49 -05:00
Andrei Betlen	b8fc1c7d83	feat: Add ability to load chat format from huggingface autotokenizer or tokenizer_config.json files.	2024-01-18 21:21:37 -05:00
Andrei Betlen	48c3b77e6f	Offload KQV by default	2024-01-18 11:08:57 -05:00
Kyle Mistele	9c36688b33	fix(cli): allow passing n_ctx=0 to openAI API server args to use model n_ctx_train field per #1015 (#1093 )	2024-01-16 18:54:06 -05:00
anil	cfb7da98ed	Support Accept text/event-stream in chat and completion endpoints, resolves #1083 (#1088 ) Co-authored-by: Anil Pathak <anil@heyday.com> Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-01-16 12:52:52 -05:00
Andrei Betlen	84615adbc6	Add split_mode option. Closes #1085	2024-01-15 12:49:20 -05:00
Phil H	76aafa6149	Implement GGUF metadata KV overrides (#1011 ) * Implement GGUF metadata overrides * whitespace fix * Fix kv overrides. * Fix pointer and pickle * Match llama.cpp kv_overrides cli argument --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-01-15 12:29:29 -05:00
Andrei Betlen	522aecb868	docs: add server config docs	2023-12-22 14:37:24 -05:00
swg	4b01a873ef	server: Support none defaulting to infinity for completions (#111 ) * Support defaulting to infinity or -1 for chat completions * Check if completion_tokens is none in error handler. * fix: max_tokens in create completion should match openai spec * Fix __call__ --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2023-12-22 14:05:13 -05:00
Dave	12b7f2f4e9	[Feat] Multi model support (#931 ) * Update Llama class to handle chat_format & caching * Add settings.py * Add util.py & update __main__.py * multimodel * update settings.py * cleanup * delete util.py * Fix /v1/models endpoint * MultiLlama now iterable, app check-alive on "/" * instant model init if file is given * backward compability * revert model param mandatory * fix error * handle individual model config json * refactor * revert chathandler/clip_model changes * handle chat_handler in MulitLlama() * split settings into server/llama * reduce global vars * Update LlamaProxy to handle config files * Add free method to LlamaProxy * update arg parsers & install server alias * refactor cache settings * change server executable name * better var name * whitespace * Revert "whitespace" This reverts commit bc5cf51c64a95bfc9926e1bc58166059711a1cd8. * remove exe_name * Fix merge bugs * Fix type annotations * Fix type annotations * Fix uvicorn app factory * Fix settings * Refactor server * Remove formatting fix * Format * Use default model if not found in model settings * Fix * Cleanup * Fix * Fix * Remove unnused CommandLineSettings * Cleanup * Support default name for copilot-codex models --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2023-12-22 05:51:25 -05:00
docmeth02	33cc623346	Implement openai api compatible authentication (#1010 )	2023-12-21 13:44:49 -05:00
Andrei Betlen	095c650006	Add offload_kqv option to llama and server	2023-12-18 15:36:09 -05:00
Brandon Roberts	62944df142	Bugfix: Remove f16_kv, add offload_kqv field (#1019 ) F16_KV appears to have been removed here: `af99c6fbfc` This addresses two issues: - #995 which just requests to add the KV cache offloading param - #1006 a NULL ptr exception when using the embeddings (introduced by leaving f16_kv in the fields struct)	2023-12-18 14:27:11 -05:00
Radoslav Gerganov	8e44a32075	Add support for running the server with SSL (#994 )	2023-12-11 20:47:11 -05:00
Andrei Betlen	1a7bf2037b	docs: Update openapi endpoint names	2023-11-24 03:39:29 -05:00
Andrei Betlen	128dc4731f	Fix #569	2023-11-21 04:39:05 -05:00
Andrei Betlen	7a3f87846b	Format	2023-11-21 04:02:20 -05:00

1 2 3 4

168 commits