* feat: add explicit methods to free model
This commit introduces a `close` method to both `Llama` and `_LlamaModel`,
allowing users to explicitly free the model from RAM/VRAM.
The previous implementation relied on the destructor of `_LlamaModel` to free
the model. However, in Python, the timing of destructor calls is unclear—for
instance, the `del` statement does not guarantee immediate invocation of the
destructor.
This commit provides an explicit method to release the model, which works
immediately and allows the user to load another model without memory issues.
Additionally, this commit implements a context manager in the `Llama` class,
enabling the automatic closure of the `Llama` object when used with the `with`
statement.
* feat: Implement ContextManager in _LlamaModel, _LlamaContext, and _LlamaBatch
This commit enables automatic resource management by
implementing the `ContextManager` protocol in `_LlamaModel`,
`_LlamaContext`, and `_LlamaBatch`. This ensures that
resources are properly managed and released within a `with`
statement, enhancing robustness and safety in resource handling.
* feat: add ExitStack for Llama's internal class closure
This update implements ExitStack to manage and close internal
classes in Llama, enhancing efficient and safe resource
management.
* Use contextlib ExitStack and closing
* Explicitly free model when closing resources on server
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
* Test dummy image tags in chat templates
* Format and improve types for llava_cpp.py
* Add from_pretrained support to llava chat format.
* Refactor llava chat format to use a jinja2
* Revert chat format test
* Add moondream support (wip)
* Update moondream chat format
* Update moondream chat format
* Update moondream prompt
* Add function calling support
* Cache last image embed
* Add Llava1.6 support
* Add nanollava support
* Add obisidian support
* Remove unnecessary import
* Re-order multimodal chat formats
* Logits all no longer required for multi-modal models
* Update README.md
* Update docs
* Update README
* Fix typo
* Update README
* Fix typo
* Add draft model param to llama class, implement basic prompt lookup decoding draft model
* Use samplingcontext for sampling
* Use 1d array
* Use draft model for sampling
* Fix dumb mistake
* Allow for later extensions to the LlamaDraftModel api
* Cleanup
* Adaptive candidate prediction
* Update implementation to match hf transformers
* Tuning
* Fix bug where last token was not used for ngram prediction
* Remove heuristic for num_pred_tokens (no benefit)
* fix: n_candidates bug.
* Add draft_model_num_pred_tokens server setting
* Cleanup
* Update README