Commit graph

671 commits

Author SHA1 Message Date
Connor
a05d90446f
fix: Circular dependancy preventing early Llama object free (#1176)
commit 901827013b introduced a cyclic dependency
within Llama objects. That change causes old models to linger in memory longer
than necessary, thereby creating memory bloat in most applications attempting
to switch between models at runtime. This patch simply removes the problematic
line, allowing models to deallocate without relying on GC. One might also
consider combining `weakref.ref` with a `@property` if the `llama` attribute is
absolutely necessary to expose in the tokenizer class.
2024-02-11 13:57:57 -05:00
Andrei Betlen
4abb8c9386 Merge branch 'main' of github.com:abetlen/llama_cpp_python into main 2024-02-09 13:32:31 -05:00
Andrei Betlen
e16f06e6eb fix: revert _create_completions. 2024-02-09 02:02:13 -05:00
Andrei Betlen
85d3374b4d fix: broken import 2024-02-08 01:13:28 -05:00
Andrei Betlen
b5fca911b5 feat: Move tokenizer to own module 2024-02-08 01:08:18 -05:00
Jeffrey Fong
901827013b
feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class (#1078)
* convert functionary-v1 chat handler to use hf autotokenizer

* add hf_tokenizer + inteegrate functionary-v1.4 prompt template

* integrate functionary v2 prompt template

* update readme

* set up parallel function calling wip

* set up parallel function calling

* Update README.md

* Update README.md

* refactor tokenizers

* include old functionary handler for backward compatibility

* add hf_tokenizer_path in server ModelSettings

* convert functionary-v1 chat handler to use hf autotokenizer

* add hf_tokenizer + inteegrate functionary-v1.4 prompt template

* integrate functionary v2 prompt template

* update readme

* set up parallel function calling wip

* resolve merge conflict

* Update README.md

* Update README.md

* refactor tokenizers

* include old functionary handler for backward compatibility

* add hf_tokenizer_path in server ModelSettings

* Cleanup PR, fix breaking changes

* Use hf_pretrained_model_name_or_path for tokenizer

* fix hf tokenizer in streaming

* update README

* refactor offset mapping

---------

Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-07 20:07:03 -05:00
Andrei Betlen
34f31040f6 Bump version 2024-02-06 12:47:59 -05:00
Andrei Betlen
59760c85ed fix: Use llama_log_callback to avoid suppress_stdout_stderr 2024-02-05 21:52:12 -05:00
Andrei Betlen
3553b14670 Update llama.cpp 2024-02-05 13:26:50 -05:00
Andrei
7467f129e5
Revert "Fix: fileno error google colab (#729) (#1156)" (#1157)
This reverts commit bebfba0f08.
2024-02-02 12:18:55 -05:00
Dulsara
bebfba0f08
Fix: fileno error google colab (#729) (#1156)
Instead of using a devnull just made a dummy class with a 'write()' method that does nothing.
2024-02-02 12:05:46 -05:00
Andrei Betlen
3322eadbf3 Bump version 2024-01-31 15:10:18 -05:00
Andrei
fb762a6041
Add speculative decoding (#1120)
* Add draft model param to llama class, implement basic prompt lookup decoding draft model

* Use samplingcontext for sampling

* Use 1d array

* Use draft model for sampling

* Fix dumb mistake

* Allow for later extensions to the LlamaDraftModel api

* Cleanup

* Adaptive candidate prediction

* Update implementation to match hf transformers

* Tuning

* Fix bug where last token was not used for ngram prediction

* Remove heuristic for num_pred_tokens (no benefit)

* fix: n_candidates bug.

* Add draft_model_num_pred_tokens server setting

* Cleanup

* Update README
2024-01-31 14:08:14 -05:00
Andrei Betlen
71e3e4c435 Update llama.cpp 2024-01-31 10:41:42 -05:00
Andrei Betlen
078cca0361 fix: Pass raise_exception and add_generation_prompt to jinja2 chat template 2024-01-31 08:42:21 -05:00
Andrei Betlen
bf9e824922 Bump version 2024-01-30 12:27:27 -05:00
Andrei Betlen
011cd84ded Update llama.cpp 2024-01-30 09:48:09 -05:00
Andrei
da003d8768
Automatically set chat format from gguf (#1110)
* Use jinja formatter to load chat format from gguf

* Fix off-by-one error in metadata loader

* Implement chat format auto-detection
2024-01-29 14:22:23 -05:00
Andrei Betlen
464af5b39f Bump version 2024-01-29 10:46:04 -05:00
Andrei Betlen
9ae5819ee4 Add chat format test. 2024-01-29 00:59:01 -05:00
Rafaelblsilva
ce38dbdf07
Add mistral instruct chat format as "mistral-instruct" (#799)
* Added mistral instruct chat format as "mistral"

* Fix stop sequence (merge issue)

* Update chat format name to `mistral-instruct`

---------

Co-authored-by: Andrei <abetlen@gmail.com>
2024-01-29 00:34:42 -05:00
Andrei Betlen
52c4a84faf Bump version 2024-01-28 19:35:37 -05:00
Andrei Betlen
c1d0fff8a9 Bump version 2024-01-27 18:36:56 -05:00
Andrei
d8f6914f45
Add json schema mode (#1122)
* Add json schema mode

* Add llava chat format support
2024-01-27 16:52:18 -05:00
Andrei Betlen
35918873b4 Update llama.cpp 2024-01-26 11:45:48 -05:00
Andrei Betlen
f5cc6b3053 Bump version 2024-01-25 11:28:16 -05:00
Andrei Betlen
cde7514c3d feat(server): include llama-cpp-python version in openapi spec 2024-01-25 11:23:18 -05:00
Andrei Betlen
c970d41a85 fix: llama_log_set should be able to accept null pointer 2024-01-24 10:38:30 -05:00
Andrei Betlen
9677a1f2c8 fix: Check order 2024-01-23 22:28:03 -05:00
Andrei Betlen
4d6b2f7b91 fix: format 2024-01-23 22:08:27 -05:00
Phil H
fe5d6ea648
fix: GGUF metadata KV overrides, re #1011 (#1116)
* kv overrides another attempt

* add sentinel element, simplify array population

* ensure sentinel element is zeroed
2024-01-23 22:00:38 -05:00
Andrei Betlen
fcdf337d84 Update llama.cpp 2024-01-22 11:25:11 -05:00
Andrei Betlen
5b982d0f8c fix: use both eos and bos tokens as stop sequences for hf-tokenizer-config chat format. 2024-01-22 08:32:48 -05:00
Andrei Betlen
2ce0b8aa2c Bump version 2024-01-21 20:30:24 -05:00
Andrei Betlen
d3f5528ca8 fix: from_json_schema oneof/anyof bug. Closes #1097 2024-01-21 19:06:53 -05:00
Andrei Betlen
24f39454e9 fix: pass chat handler not chat formatter for huggingface autotokenizer and tokenizer_config formats. 2024-01-21 18:38:04 -05:00
Andrei Betlen
7f3209b1eb feat: Add add_generation_prompt option for jinja2chatformatter. 2024-01-21 18:37:24 -05:00
Andrei Betlen
be09318c26 feat: Add Jinja2ChatFormatter 2024-01-19 15:04:42 -05:00
Andrei Betlen
5a34c57e54 feat: Expose gguf model metadata in metadata property 2024-01-19 10:46:03 -05:00
Andrei Betlen
833a7f1a86 Bump version 2024-01-19 09:03:35 -05:00
Andrei Betlen
3babe3512c Fix mirostat sampling 2024-01-19 08:31:59 -05:00
Andrei Betlen
141293a75b Fix python3.8 support 2024-01-19 08:17:49 -05:00
Andrei Betlen
656f3d8968 Bump version 2024-01-18 21:30:36 -05:00
Andrei Betlen
89cce50f8c Update llama.cpp 2024-01-18 21:21:49 -05:00
Andrei Betlen
b8fc1c7d83 feat: Add ability to load chat format from huggingface autotokenizer or tokenizer_config.json files. 2024-01-18 21:21:37 -05:00
Andrei Betlen
48c3b77e6f Offload KQV by default 2024-01-18 11:08:57 -05:00
Austin
6bfe98bd80
Integration of Jinja2 Templating (#875)
* feat: Add support for jinja templating

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

* fix: Refactor chat formatter and update interface for jinja templates

- Simplify the `llama2_template` in `llama_jinja_format.py` by removing unnecessary line breaks for readability without affecting functionality.
- Update `ChatFormatterInterface` constructor to accept a more generic `Optional[object]` type for the template parameter, enhancing flexibility.
- Introduce a `template` property to `ChatFormatterInterface` for standardized access to the template string.
- Replace `MetaSingleton` metaclass with `Singleton` for the `ChatFormatterFactory` to streamline the singleton implementation.

These changes enhance code readability, maintain usability, and ensure consistency in the chat formatter's design pattern usage.

* Add outline for Jinja2 templating integration documentation

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

* Add jinja2 as a dependency with version range for Hugging Face transformers compatibility

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

* Update jinja2 version constraint for mkdocs-material compatibility

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

* Fix attribute name in AutoChatFormatter

- Changed attribute name from `self._renderer` to `self._environment`

---------

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
2024-01-17 09:47:52 -05:00
Andrei Betlen
7b46bb5a78 Re-order classes in llama.py 2024-01-17 09:16:13 -05:00
Andrei Betlen
cc4630e66f Move helper classes to _internals submodule 2024-01-17 09:14:00 -05:00
Andrei Betlen
3b92419132 Move cache classes to llama_cache submodule. 2024-01-17 09:09:12 -05:00
Kyle Mistele
9c36688b33
fix(cli): allow passing n_ctx=0 to openAI API server args to use model n_ctx_train field per #1015 (#1093) 2024-01-16 18:54:06 -05:00
anil
cfb7da98ed
Support Accept text/event-stream in chat and completion endpoints, resolves #1083 (#1088)
Co-authored-by: Anil Pathak <anil@heyday.com>
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2024-01-16 12:52:52 -05:00
Andrei Betlen
4b11fa83c0 Bump version 2024-01-15 12:54:51 -05:00
Andrei Betlen
84615adbc6 Add split_mode option. Closes #1085 2024-01-15 12:49:20 -05:00
Phil H
76aafa6149
Implement GGUF metadata KV overrides (#1011)
* Implement GGUF metadata overrides

* whitespace fix

* Fix kv overrides.

* Fix pointer and pickle

* Match llama.cpp kv_overrides cli argument

---------

Co-authored-by: Andrei <abetlen@gmail.com>
2024-01-15 12:29:29 -05:00
yieldthought
7eff42c239
Avoid "LookupError: unknown encoding: ascii" when open() called in a destructor (#1012)
The existing code often causes "LookupError: unknown encoding: ascii" when open() called in a destructor. Saving open in self.open is not enough to avoid this. Instead, we can avoid reopening /dev/null every time by doing it once when the module is loaded.
2024-01-15 10:52:10 -05:00
Mark Neumann
c689ccc728
Fix Pydantic model parsing (#1087) 2024-01-15 10:45:57 -05:00
Andrei Betlen
5502ac8876 Update llama.cpp 2024-01-15 10:12:10 -05:00
Andrei Betlen
359ae73643 Update llama.cpp 2024-01-14 08:17:22 -05:00
Andrei Betlen
7c898d5684 Update llama.cpp 2024-01-13 22:37:49 -05:00
Andrei Betlen
bb610b9428 Update llama.cpp 2024-01-11 22:51:12 -05:00
Andrei Betlen
f0159663d9 Bump version 2024-01-10 02:51:17 -05:00
Stephen Hankinson
df3be58d6c
Add ability to pass in penalize_nl param (#1068) 2024-01-10 02:46:27 -05:00
Joseph Turian
2ddce7294e
print_grammar to stderr (#1052) 2024-01-10 02:46:03 -05:00
Andrei Betlen
1ae05c102b Update llama.cpp 2024-01-08 14:51:29 -05:00
Andrei Betlen
75d0527fd7 Bump version 2024-01-04 18:30:12 -05:00
Fedor Moiseev
907b9e9d42
Add Saiga chat format. (#1050) 2024-01-04 18:12:58 -05:00
xaviviro
cf743ec5d3
Added ChatGLM chat format (#1059)
Co-authored-by: Xavier Vinaixa Rosello <xaviviro@MacBook-Pro-de-Xavier.local>
2024-01-04 18:12:02 -05:00
Andrei Betlen
eb9c7d4ed8 Update llama.cpp 2024-01-03 22:04:04 -05:00
Andrei Betlen
011c3630f5 Bump version 2023-12-27 17:35:02 -05:00
Andrei Betlen
92284f32cb Add HIP_PATH to dll search directories for windows users. 2023-12-22 15:29:56 -05:00
Andrei Betlen
2b0d3f36fa set llama_max_devices using library function 2023-12-22 15:19:28 -05:00
Andrei Betlen
d9a1d90fd7 Fix typo 2023-12-22 15:12:27 -05:00
Andrei Betlen
37556bf9c4 Bump version 2023-12-22 14:55:58 -05:00
Andrei Betlen
6d8bc090f9 fix: inccorect bindings for kv override. Based on #1011 2023-12-22 14:52:20 -05:00
Andrei Betlen
522aecb868 docs: add server config docs 2023-12-22 14:37:24 -05:00
Andrei Betlen
6473796343 Update llama.cpp 2023-12-22 14:10:34 -05:00
swg
4b01a873ef
server: Support none defaulting to infinity for completions (#111)
* Support defaulting to infinity or -1 for chat completions

* Check if completion_tokens is none in error handler.

* fix: max_tokens in create completion should match openai spec

* Fix __call__

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2023-12-22 14:05:13 -05:00
Dave
12b7f2f4e9
[Feat] Multi model support (#931)
* Update Llama class to handle chat_format & caching

* Add settings.py

* Add util.py & update __main__.py

* multimodel

* update settings.py

* cleanup

* delete util.py

* Fix /v1/models endpoint

* MultiLlama now iterable, app check-alive on "/"

* instant model init if file is given

* backward compability

* revert model param mandatory

* fix error

* handle individual model config json

* refactor

* revert chathandler/clip_model changes

* handle chat_handler in MulitLlama()

* split settings into server/llama

* reduce global vars

* Update LlamaProxy to handle config files

* Add free method to LlamaProxy

* update arg parsers & install server alias

* refactor cache settings

* change server executable name

* better var name

* whitespace

* Revert "whitespace"

This reverts commit bc5cf51c64a95bfc9926e1bc58166059711a1cd8.

* remove exe_name

* Fix merge bugs

* Fix type annotations

* Fix type annotations

* Fix uvicorn app factory

* Fix settings

* Refactor server

* Remove formatting fix

* Format

* Use default model if not found in model settings

* Fix

* Cleanup

* Fix

* Fix

* Remove unnused CommandLineSettings

* Cleanup

* Support default name for copilot-codex models

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2023-12-22 05:51:25 -05:00
Andrei Betlen
4a85442c35 Update llama.cpp 2023-12-22 00:12:37 -05:00
twaka
2f03fb0231
fix text_offset of multi-token characters (#1037)
* fix text_offsets for bytes tokens

* fix
2023-12-22 00:03:29 -05:00
docmeth02
33cc623346
Implement openai api compatible authentication (#1010) 2023-12-21 13:44:49 -05:00
Andrei Betlen
a05b4da80a fix: float32 is not JSON serializable when streaming logits. 2023-12-18 18:40:36 -05:00
Andrei Betlen
7df6c32544 Fix type annotations 2023-12-18 18:14:53 -05:00
Andrei Betlen
b703aad79e Fix type annotation 2023-12-18 18:13:37 -05:00
Andrei Betlen
d0aedfcff6 Fix type annotation 2023-12-18 18:12:49 -05:00
Eduard Christian Dumitrescu
2993936b10
Fix ctypes definitions of llama_kv_cache_view_update and llama_kv_cache_view_free. (#1028) 2023-12-18 18:11:26 -05:00
Andrei Betlen
5e863d8a3b Bump version 2023-12-18 16:09:18 -05:00
Andrei Betlen
095c650006 Add offload_kqv option to llama and server 2023-12-18 15:36:09 -05:00
Andrei Betlen
472b344ae3 Remove unnused import 2023-12-18 15:32:40 -05:00
kddubey
6b2e0e05b4
perf: Don't convert logprobs arrays to lists (#1021) 2023-12-18 14:28:12 -05:00
Brandon Roberts
62944df142
Bugfix: Remove f16_kv, add offload_kqv field (#1019)
F16_KV appears to have been removed here: af99c6fbfc

This addresses two issues:

 - #995 which just requests to add the KV cache offloading param
 - #1006 a NULL ptr exception when using the embeddings (introduced by
   leaving f16_kv in the fields struct)
2023-12-18 14:27:11 -05:00
Daniele Morotti
f1c631dc53
Bug fixed with n_ctx=0 (#1015)
If the n_ctx is set to 0 the code should use the maximum context length of the selected model, but it didn't work. There was a problem with the initialization of this parameter and a related problem with 'n_batch'.
2023-12-16 18:59:50 -05:00
kddubey
5a8944672f
Fix logits_to_logprobs for 2-D and 3-D logits (#1002)
* Fix logits_to_logprobs for 2-D and 3-D logits

* Set dtype to single

* Test size
2023-12-16 18:59:26 -05:00
Andrei Betlen
534b1ea9b5 Update llama.cpp 2023-12-16 18:57:43 -05:00
Andrei Betlen
cbce061ffd Bump version 2023-12-13 21:52:29 -05:00
yhfgyyf
8b4db732bd
Add qwen chat format (#1005) 2023-12-13 21:43:43 -05:00
Andrei Betlen
690c563b60 Merge branch 'main' of github.com:abetlen/llama_cpp_python into main 2023-12-13 21:43:19 -05:00
Andrei Betlen
c0fc0a1e82 Update llama.cpp 2023-12-13 21:43:16 -05:00
Radoslav Gerganov
8e44a32075
Add support for running the server with SSL (#994) 2023-12-11 20:47:11 -05:00