baalajimaestro/llama.cpp

Author	SHA1	Message	Date
Junpei Kawamoto	320a5d7ea5	feat: Add `.close()` method to `Llama` class to explicitly free model from memory (#1513 ) * feat: add explicit methods to free model This commit introduces a `close` method to both `Llama` and `_LlamaModel`, allowing users to explicitly free the model from RAM/VRAM. The previous implementation relied on the destructor of `_LlamaModel` to free the model. However, in Python, the timing of destructor calls is unclear—for instance, the `del` statement does not guarantee immediate invocation of the destructor. This commit provides an explicit method to release the model, which works immediately and allows the user to load another model without memory issues. Additionally, this commit implements a context manager in the `Llama` class, enabling the automatic closure of the `Llama` object when used with the `with` statement. * feat: Implement ContextManager in _LlamaModel, _LlamaContext, and _LlamaBatch This commit enables automatic resource management by implementing the `ContextManager` protocol in `_LlamaModel`, `_LlamaContext`, and `_LlamaBatch`. This ensures that resources are properly managed and released within a `with` statement, enhancing robustness and safety in resource handling. * feat: add ExitStack for Llama's internal class closure This update implements ExitStack to manage and close internal classes in Llama, enhancing efficient and safe resource management. * Use contextlib ExitStack and closing * Explicitly free model when closing resources on server --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-06-13 04:16:14 -04:00
nullname	d634efcdd9	feat: adding `rpc_servers` parameter to `Llama` class (#1477 ) * passthru rpc_servers params wip * enable llama rpc by default * convert string to byte * add rpc package * Revert "enable llama rpc by default" This reverts commit 832c6dd56c979514cec5df224bf2d2014dccd790. * update readme * Only set rpc_servers when provided * Add rpc servers to server options --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-06-04 10:38:21 -04:00
Andrei Betlen	df45a4b3fe	fix: fix string value kv_overrides. Closes #1487	2024-05-29 02:02:22 -04:00
twaka	5212fb08ae	feat: add MinTokensLogitProcessor and min_tokens argument to server (#1333 ) * implement min_tokens * set default to 0 * pass min_tokens * fix * remove copy * implement MinTokensLogitsProcessor * format * fix condition	2024-05-14 09:50:53 -04:00
Andrei Betlen	0318702cdc	feat(server): Add support for setting root_path. Closes #1420	2024-05-05 12:49:31 -04:00
Andrei Betlen	f9b7221c8f	Merge branch 'main' of github.com:abetlen/llama_cpp_python into main	2024-05-03 19:07:54 -04:00
Andrei Betlen	0a454bebe6	feat(server): Remove temperature bounds checks for server. Closes #1384	2024-05-03 15:23:06 -04:00
Daniel Thuerck	2138561fab	fix(server): Propagate `flash_attn` to model load. (#1424 )	2024-05-03 12:17:07 -04:00
Andrei Betlen	31b1d95a6c	feat: Add llama-3-vision-alpha chat format	2024-05-02 11:32:18 -04:00
Andrei Betlen	22d77eefd2	feat: Add option to enable `flash_attn` to Lllama params and ModelSettings	2024-04-30 09:29:16 -04:00
Andrei	fe2da09538	feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) (#1147 ) * Test dummy image tags in chat templates * Format and improve types for llava_cpp.py * Add from_pretrained support to llava chat format. * Refactor llava chat format to use a jinja2 * Revert chat format test * Add moondream support (wip) * Update moondream chat format * Update moondream chat format * Update moondream prompt * Add function calling support * Cache last image embed * Add Llava1.6 support * Add nanollava support * Add obisidian support * Remove unnecessary import * Re-order multimodal chat formats * Logits all no longer required for multi-modal models * Update README.md * Update docs * Update README * Fix typo * Update README * Fix typo	2024-04-30 01:35:38 -04:00
Andrei Betlen	fcfea66857	fix: pydantic deprecation warning	2024-04-25 21:21:48 -04:00
Sean Bailey	53ebcc8bb5	feat(server): Provide ability to dynamically allocate all threads if desired using `-1` (#1364 )	2024-04-23 02:35:38 -04:00
khimaros	b73c73c0c6	feat: add `disable_ping_events` flag (#1257 ) for backward compatibility, this is false by default it can be set to true to disable EventSource pings which are not supported by some OpenAI clients. fixes https://github.com/abetlen/llama-cpp-python/issues/1256	2024-04-17 10:08:19 -04:00
ddh0	c96b2daebf	feat: Use all available CPUs for batch processing (#1345 )	2024-04-17 10:05:54 -04:00
Andrei Betlen	060bfa64d5	feat: Add support for yaml based configs	2024-04-10 02:47:01 -04:00
Limour	f165048a69	feat: add support for KV cache quantization options (#1307 ) * add KV cache quantization options https://github.com/abetlen/llama-cpp-python/discussions/1220 https://github.com/abetlen/llama-cpp-python/issues/1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-04-01 10:19:28 -04:00
windspirit95	aa9f1ae011	feat: Add logprobs support to chat completions (#1311 ) * Add logprobs return in ChatCompletionResponse * Fix duplicate field * Set default to false * Simplify check * Add server example --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-03-31 13:30:13 -04:00
Andrei Betlen	d11ccc3036	fix(server): minor type fixes	2024-03-23 17:14:15 -04:00
Andrei Betlen	f7decc9562	docs: Add chat examples to openapi ui	2024-03-19 10:52:53 -04:00
Felipe Lorenz	c139f8b5d5	feat: Add endpoints for tokenize, detokenize and count tokens (#1136 ) * Add endpoint to count tokens * Add tokenize and detokenize endpoints * Change response key to tokens for tokenize endpoint * Fix dependency bug * Cleanup * Remove example added by mistake * Move tokenize, detokenize, and count to Extras namespace. Tag existing endpoints --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-03-08 21:09:00 -05:00
Andrei Betlen	727d60c28a	misc: Format	2024-02-28 14:27:40 -05:00
Andrei Betlen	0d37ce52b1	feat: Update llama.cpp	2024-02-28 14:27:16 -05:00
Andrei	4d574bd765	feat(server): Add support for pulling models from Huggingface Hub (#1222 ) * Basic support for hf pull on server * Add hf_model_repo_id setting * Update README	2024-02-26 14:35:08 -05:00
Andrei Betlen	dcf38f6141	fix: remove prematurely commited change	2024-02-25 21:00:37 -05:00
Andrei Betlen	2292af5796	feat: Update llama.cpp	2024-02-25 16:53:58 -05:00
Andrei Betlen	fdce078cb9	feat: Update llama.cpp	2024-02-17 00:37:51 -05:00
khimaros	ea1f88dd29	fix: Use '\n' seperator for EventSourceResponse (#1188 ) this fixes compatibility with some OpenAI clients, including BetterChatGPT (https://github.com/ztjhz/BetterChatGPT/issues/537). Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-15 15:20:13 -05:00
Andrei Betlen	85d3374b4d	fix: broken import	2024-02-08 01:13:28 -05:00
Jeffrey Fong	901827013b	feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class (#1078 ) * convert functionary-v1 chat handler to use hf autotokenizer * add hf_tokenizer + inteegrate functionary-v1.4 prompt template * integrate functionary v2 prompt template * update readme * set up parallel function calling wip * set up parallel function calling * Update README.md * Update README.md * refactor tokenizers * include old functionary handler for backward compatibility * add hf_tokenizer_path in server ModelSettings * convert functionary-v1 chat handler to use hf autotokenizer * add hf_tokenizer + inteegrate functionary-v1.4 prompt template * integrate functionary v2 prompt template * update readme * set up parallel function calling wip * resolve merge conflict * Update README.md * Update README.md * refactor tokenizers * include old functionary handler for backward compatibility * add hf_tokenizer_path in server ModelSettings * Cleanup PR, fix breaking changes * Use hf_pretrained_model_name_or_path for tokenizer * fix hf tokenizer in streaming * update README * refactor offset mapping --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-07 20:07:03 -05:00
Andrei	fb762a6041	Add speculative decoding (#1120 ) * Add draft model param to llama class, implement basic prompt lookup decoding draft model * Use samplingcontext for sampling * Use 1d array * Use draft model for sampling * Fix dumb mistake * Allow for later extensions to the LlamaDraftModel api * Cleanup * Adaptive candidate prediction * Update implementation to match hf transformers * Tuning * Fix bug where last token was not used for ngram prediction * Remove heuristic for num_pred_tokens (no benefit) * fix: n_candidates bug. * Add draft_model_num_pred_tokens server setting * Cleanup * Update README	2024-01-31 14:08:14 -05:00
Andrei	da003d8768	Automatically set chat format from gguf (#1110 ) * Use jinja formatter to load chat format from gguf * Fix off-by-one error in metadata loader * Implement chat format auto-detection	2024-01-29 14:22:23 -05:00
Andrei Betlen	cde7514c3d	feat(server): include llama-cpp-python version in openapi spec	2024-01-25 11:23:18 -05:00
Andrei Betlen	24f39454e9	fix: pass chat handler not chat formatter for huggingface autotokenizer and tokenizer_config formats.	2024-01-21 18:38:04 -05:00
Andrei Betlen	141293a75b	Fix python3.8 support	2024-01-19 08:17:49 -05:00
Andrei Betlen	b8fc1c7d83	feat: Add ability to load chat format from huggingface autotokenizer or tokenizer_config.json files.	2024-01-18 21:21:37 -05:00
Andrei Betlen	48c3b77e6f	Offload KQV by default	2024-01-18 11:08:57 -05:00
Kyle Mistele	9c36688b33	fix(cli): allow passing n_ctx=0 to openAI API server args to use model n_ctx_train field per #1015 (#1093 )	2024-01-16 18:54:06 -05:00
anil	cfb7da98ed	Support Accept text/event-stream in chat and completion endpoints, resolves #1083 (#1088 ) Co-authored-by: Anil Pathak <anil@heyday.com> Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-01-16 12:52:52 -05:00
Andrei Betlen	84615adbc6	Add split_mode option. Closes #1085	2024-01-15 12:49:20 -05:00
Phil H	76aafa6149	Implement GGUF metadata KV overrides (#1011 ) * Implement GGUF metadata overrides * whitespace fix * Fix kv overrides. * Fix pointer and pickle * Match llama.cpp kv_overrides cli argument --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-01-15 12:29:29 -05:00
Andrei Betlen	522aecb868	docs: add server config docs	2023-12-22 14:37:24 -05:00
swg	4b01a873ef	server: Support none defaulting to infinity for completions (#111 ) * Support defaulting to infinity or -1 for chat completions * Check if completion_tokens is none in error handler. * fix: max_tokens in create completion should match openai spec * Fix __call__ --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2023-12-22 14:05:13 -05:00
Dave	12b7f2f4e9	[Feat] Multi model support (#931 ) * Update Llama class to handle chat_format & caching * Add settings.py * Add util.py & update __main__.py * multimodel * update settings.py * cleanup * delete util.py * Fix /v1/models endpoint * MultiLlama now iterable, app check-alive on "/" * instant model init if file is given * backward compability * revert model param mandatory * fix error * handle individual model config json * refactor * revert chathandler/clip_model changes * handle chat_handler in MulitLlama() * split settings into server/llama * reduce global vars * Update LlamaProxy to handle config files * Add free method to LlamaProxy * update arg parsers & install server alias * refactor cache settings * change server executable name * better var name * whitespace * Revert "whitespace" This reverts commit bc5cf51c64a95bfc9926e1bc58166059711a1cd8. * remove exe_name * Fix merge bugs * Fix type annotations * Fix type annotations * Fix uvicorn app factory * Fix settings * Refactor server * Remove formatting fix * Format * Use default model if not found in model settings * Fix * Cleanup * Fix * Fix * Remove unnused CommandLineSettings * Cleanup * Support default name for copilot-codex models --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2023-12-22 05:51:25 -05:00
docmeth02	33cc623346	Implement openai api compatible authentication (#1010 )	2023-12-21 13:44:49 -05:00
Andrei Betlen	095c650006	Add offload_kqv option to llama and server	2023-12-18 15:36:09 -05:00
Brandon Roberts	62944df142	Bugfix: Remove f16_kv, add offload_kqv field (#1019 ) F16_KV appears to have been removed here: `af99c6fbfc` This addresses two issues: - #995 which just requests to add the KV cache offloading param - #1006 a NULL ptr exception when using the embeddings (introduced by leaving f16_kv in the fields struct)	2023-12-18 14:27:11 -05:00
Radoslav Gerganov	8e44a32075	Add support for running the server with SSL (#994 )	2023-12-11 20:47:11 -05:00
Andrei Betlen	1a7bf2037b	docs: Update openapi endpoint names	2023-11-24 03:39:29 -05:00
Andrei Betlen	128dc4731f	Fix #569	2023-11-21 04:39:05 -05:00

1 2 3 4

169 commits