* feat: add explicit methods to free model
This commit introduces a `close` method to both `Llama` and `_LlamaModel`,
allowing users to explicitly free the model from RAM/VRAM.
The previous implementation relied on the destructor of `_LlamaModel` to free
the model. However, in Python, the timing of destructor calls is unclear—for
instance, the `del` statement does not guarantee immediate invocation of the
destructor.
This commit provides an explicit method to release the model, which works
immediately and allows the user to load another model without memory issues.
Additionally, this commit implements a context manager in the `Llama` class,
enabling the automatic closure of the `Llama` object when used with the `with`
statement.
* feat: Implement ContextManager in _LlamaModel, _LlamaContext, and _LlamaBatch
This commit enables automatic resource management by
implementing the `ContextManager` protocol in `_LlamaModel`,
`_LlamaContext`, and `_LlamaBatch`. This ensures that
resources are properly managed and released within a `with`
statement, enhancing robustness and safety in resource handling.
* feat: add ExitStack for Llama's internal class closure
This update implements ExitStack to manage and close internal
classes in Llama, enhancing efficient and safe resource
management.
* Use contextlib ExitStack and closing
* Explicitly free model when closing resources on server
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
* Support SPM infill
* typo--
* one less layer of parenthesis necessary
* new required internals
* manually add bos/eos if model requires it
* add bos even when unknown
This is identical behaviour to llama.cpp
I guess any model that doesn't use BOS is recent enough to have the add_bos_token metadata.
* don't add bos/eos on non-infill pre-tokenized prompt
* add tokenizer hack to remove leading space in suffix
* I keep forgetting metadata are strings
* check if bos exists
* add example
* add cls/sep instead of bos/eos for WPM vocab
* simplify
* color-code filtered suffix
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
* Templates sometimes have BOS in them, remove duplicate
* tokenize chat format prompts before completion
This is to ensure that we don't duplicate any special tokens.
Hopefully I amended the existing formats correctly?
* updated comment
* corrected a few
* add some missing internals
* proper bos/eos detection
* just let tokenizer do the job
* typo--
* align test with new response
* changed to a warning
* move to another PR
* Use python warnings module
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Chat templates are rendered with ImmutableSandboxedEnvironment in transformers so no need to do otherwise here.
Co-authored-by: Andrei <abetlen@gmail.com>
* Support multiple chat templates - step 1
As a first step, allow user to to select template from metadata with chat_format parameter in the form of `chat_template.name`.
* register chat templates to self.chat_formats instead of globally
* Don't expose internal chat handlers yet
---------
Co-authored-by: Andrei <abetlen@gmail.com>
* Proper fill-in-middle support
Use prefix/middle/suffix tokens when metadata is present in GGUF, like f.ex. in [this](https://huggingface.co/CISCai/CodeQwen1.5-7B-Chat-SOTA-GGUF) one.
* fall back to internal prefix/middle/suffix id
In some cases llama.cpp will make a guess at fim tokens, use them if there's no metadata.
* typo--
* don't insert special tokens that are not there in suffix
Note: add_bos is misnamed, it's actually add_special and can cause several special tokens to be added to the token list (the special parameter is actually parse_special).
* don't add/parse any special tokens when using fim
I've left original behavior when no fim tokens are found, but this should perhaps be re-evaluated.
* don't append suffix to prompt_tokens unless fim tokens are detected
* make sure we only do this for fim
---------
Co-authored-by: Andrei <abetlen@gmail.com>
* set up streaming for v2
* assert v2 streaming, fix tool_call vs function_call
* fix streaming with tool_choice/function_call
* make functions return 1 function call only when 'auto'
* fix
---------
Co-authored-by: Andrei <abetlen@gmail.com>