Commit graph

757 commits

Author SHA1 Message Date
Andrei Betlen
20e6815252 fix: json mode 2024-03-15 12:58:34 -04:00
Andrei Betlen
4084aabe86 fix: set default pooling type to unspecified 2024-03-14 10:04:57 -04:00
Andrei Betlen
d318cc8b83 fix: Set default pooling_type to mean, check for null pointer. 2024-03-14 09:17:41 -04:00
Andrei Betlen
dd0ee56217 feat: Update llama.cpp 2024-03-13 15:57:35 -04:00
Andrei Betlen
08e910f7a7 feat: Update llama.cpp 2024-03-10 23:45:05 -04:00
Andrei Betlen
a7281994d8 chore: Bump version 2024-03-08 21:14:44 -05:00
Andrei Betlen
919fca9f2b Merge branch 'main' of https://github.com/abetlen/llama-cpp-python into main 2024-03-08 21:10:56 -05:00
Andrei Betlen
d02a9cf16f Fixed json strings grammar by blacklisting character control set. Closes #1259 2024-03-08 21:10:53 -05:00
Felipe Lorenz
c139f8b5d5
feat: Add endpoints for tokenize, detokenize and count tokens (#1136)
* Add endpoint to count tokens

* Add tokenize and detokenize endpoints

* Change response key to tokens for tokenize endpoint

* Fix dependency bug

* Cleanup

* Remove example added by mistake

* Move tokenize, detokenize, and count to Extras namespace. Tag existing endpoints

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2024-03-08 21:09:00 -05:00
Kevin Cao
1f3156d4f2
fix: Check for existence of clip model path (#1264) 2024-03-08 21:00:10 -05:00
Douglas Hanley
2811014bae
feat: Switch embed to llama_get_embeddings_seq (#1263)
* switch to llama_get_embeddings_seq

* Remove duplicate definition of llama_get_embeddings_seq

Co-authored-by: Andrei <abetlen@gmail.com>

---------

Co-authored-by: Andrei <abetlen@gmail.com>
2024-03-08 20:59:35 -05:00
Andrei Betlen
40c6b54f68 feat: Update llama.cpp 2024-03-08 20:58:50 -05:00
Andrei Betlen
93dc56ace8 Update llama.cpp 2024-03-06 01:32:00 -05:00
Andrei Betlen
87a6e5797e feat: Update llama.cpp 2024-03-03 11:27:04 -05:00
Andrei Betlen
13177aae0f chore: Bump version 2024-03-02 22:46:40 -05:00
Andrei Betlen
0e70984fb6 feat: Update llama.cpp 2024-03-02 22:20:04 -05:00
Andrei Betlen
d5df431278 chore: Bump version 2024-03-01 13:15:16 -05:00
Andrei Betlen
97aa3a153d docs: Add information re: auto chat formats. Closes #1236 2024-03-01 13:10:25 -05:00
Andrei Betlen
f062a7f51d feat: Update llama.cpp 2024-03-01 12:57:16 -05:00
Andrei Betlen
8c71725d53 fix: Remove deprecated cfg sampling functions 2024-02-28 14:37:07 -05:00
Andrei Betlen
727d60c28a misc: Format 2024-02-28 14:27:40 -05:00
Andrei Betlen
0d37ce52b1 feat: Update llama.cpp 2024-02-28 14:27:16 -05:00
Andrei Betlen
ffcd4b2636 chore: Bump version 2024-02-28 01:38:32 -05:00
Sigbjørn Skjæret
c36ab15e68
fix: eos/bos_token set correctly for Jinja2ChatFormatter and automatic chat formatter (#1230)
The token strings were not correctly retrieved (empty).
2024-02-28 01:30:31 -05:00
Andrei Betlen
fea33c9b94 feat: Update llama.cpp 2024-02-27 12:22:17 -05:00
Andrei
4d574bd765
feat(server): Add support for pulling models from Huggingface Hub (#1222)
* Basic support for hf pull on server

* Add hf_model_repo_id setting

* Update README
2024-02-26 14:35:08 -05:00
Andrei Betlen
afe1e445c9 chore: Bump version 2024-02-26 11:43:24 -05:00
Andrei Betlen
9558ce7878 feat: Update llama.cpp 2024-02-26 11:40:58 -05:00
Andrei Betlen
dbaba3059d fix: positional arguments only for low-level api 2024-02-26 11:31:11 -05:00
Andrei Betlen
78e536dcfe fix: typo 2024-02-26 11:14:26 -05:00
Andrei Betlen
44558cbd7a misc: llava_cpp use ctypes function decorator for binding 2024-02-26 11:07:33 -05:00
Andrei Betlen
8383a9e562 fix: llava this function takes at least 4 arguments (0 given) 2024-02-26 11:03:20 -05:00
Andrei Betlen
8e03fd9957 chore: Bump version 2024-02-25 21:15:42 -05:00
Andrei Betlen
dcf38f6141 fix: remove prematurely commited change 2024-02-25 21:00:37 -05:00
Andrei Betlen
cbbcd888af feat: Update llama.cpp 2024-02-25 20:52:14 -05:00
Andrei Betlen
19234aa0db fix: Restore type hints for low-level api 2024-02-25 16:54:37 -05:00
Andrei Betlen
2292af5796 feat: Update llama.cpp 2024-02-25 16:53:58 -05:00
Andrei Betlen
221edb9ef1 feat: Update llama.cpp 2024-02-24 23:47:29 -05:00
Andrei Betlen
20ea6fd7d6 chore: Bump version 2024-02-23 12:38:36 -05:00
Andrei Betlen
47bad30dd7 fix: LlamaHFTokenizer now receives pre_tokens 2024-02-23 12:23:24 -05:00
Andrei Betlen
ded5d627a5 chore: Bump version 2024-02-23 11:32:43 -05:00
Luke Stanley
858496224e
feat: Auto detect Mixtral's slightly different format (#1214) 2024-02-23 11:27:38 -05:00
Andrei Betlen
db776a885c fix: module 'llama_cpp.llama_cpp' has no attribute 'c_uint8' 2024-02-23 11:24:53 -05:00
Andrei Betlen
427d816ebf chore: Bump version 2024-02-23 04:54:08 -05:00
Alvaro Bartolome
251a8a2cad
feat: Add Google's Gemma formatting via chat_format="gemma" (#1210)
* Add Google's Gemma formatting via `chat_format="gemma"`

* Replace `raise ValueError` with `logger.debug`

Co-authored-by: Andrei <abetlen@gmail.com>

---------

Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-23 04:40:52 -05:00
Andrei Betlen
b9aca612af misc: use typesafe byref for internal classes 2024-02-23 03:40:07 -05:00
Andrei Betlen
a0ce429dc0 misc: use decorator to bind low level api functions, fixes docs 2024-02-23 03:39:38 -05:00
Andrei Betlen
e10af30cf1 fix: TypeAlias import error 2024-02-22 03:27:28 -05:00
Andrei Betlen
3561ebf536 Merge branch 'main' of https://github.com/abetlen/llama-cpp-python into main 2024-02-22 03:25:13 -05:00
Andrei Betlen
aefcb8f71a misc: additional type annotations for low level api 2024-02-22 02:00:09 -05:00
Andrei Betlen
3921e10770 feat: support minItems/maxItems in JSON grammar converter (by @nopperl) 2024-02-22 00:17:06 -05:00
Andrei Betlen
e6d6260a91 fix: Update from_pretrained defaults to match hf_hub_download 2024-02-22 00:10:23 -05:00
Andrei Betlen
dd22010e85 fix: Raise exceptions when llama model or context fails to load 2024-02-22 00:09:45 -05:00
Andrei Betlen
3632241e98 chore: Bump version 2024-02-21 23:09:13 -05:00
Andrei Betlen
0653e15c20 feat: Update llama.cpp 2024-02-21 23:04:52 -05:00
Andrei Betlen
7981e9ce1e chore: Bump version 2024-02-21 16:30:59 -05:00
Andrei
7f51b6071f
feat(low-level-api): Improve API static type-safety and performance (#1205) 2024-02-21 16:25:38 -05:00
Andrei
0f8aa4ab5c
feat: Pull models directly from huggingface (#1206)
* Add from_pretrained method to Llama class

* Update docs

* Merge filename and pattern
2024-02-21 16:25:10 -05:00
Andrei Betlen
e42f62c247 chore: Bump version 2024-02-21 11:09:40 -05:00
Andrei Betlen
4edde21b3d feat: Update llama.cpp 2024-02-21 11:05:58 -05:00
Andrei Betlen
6225f027e5 feat: Update llama.cpp 2024-02-19 04:11:34 -05:00
Andrei Betlen
748c0ce057 feat: Update llama.cpp 2024-02-18 21:30:36 -05:00
Andrei Betlen
53f6f5f415 fix: self.numa missing 2024-02-17 01:02:33 -05:00
Andrei Betlen
fdce078cb9 feat: Update llama.cpp 2024-02-17 00:37:51 -05:00
Andrei Betlen
f736827b9b chore: Bump version 2024-02-15 23:10:50 -05:00
Andrei Betlen
0ce66bc080 fix: create_embedding broken response for input type str 2024-02-15 16:09:48 -05:00
khimaros
ea1f88dd29
fix: Use '\n' seperator for EventSourceResponse (#1188)
this fixes compatibility with some OpenAI clients, including BetterChatGPT (https://github.com/ztjhz/BetterChatGPT/issues/537).

Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-15 15:20:13 -05:00
Andrei Betlen
a5cfeb7763 feat: Update llama.cpp 2024-02-15 15:17:30 -05:00
Douglas Hanley
7bb91f025f
fix: Incorporate embedding pooling layer fixes (#1194)
* remove division by token count

* truncate to n_batch, not n_ctx
2024-02-15 15:16:30 -05:00
Andrei Betlen
ae71ad1a14 Bump version 2024-02-14 04:31:42 -05:00
Douglas Hanley
d7a67917ba
feat: Support batch embeddings (#1186)
* handle batched embeddings

* fix normalization issue

* fix type hints, ensure no breaking changes to embed

* Clear kv cache / reset internal state after embedding complete

---------

Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-14 04:26:09 -05:00
Andrei Betlen
7b9960d1cb Update llama.cpp 2024-02-14 03:47:21 -05:00
Andrei Betlen
6943bab6d8 fix: destructor exception where internal classes are missing some uninitialized attributes 2024-02-14 03:38:41 -05:00
Andrei Betlen
07a783779a fix: Update openbuddy prompt format. Closes #1155 2024-02-13 23:57:10 -05:00
Andrei Betlen
345215a76c fix: more chatml-function-calling fixes 2024-02-13 23:02:50 -05:00
Andrei Betlen
b1637c2319 Bump version 2024-02-13 12:35:04 -05:00
Andrew Lapp
d6be5333e1
fix: sample idx off-by-one error for logit_processors (#1179)
* fix sample_idx off-by-one error

* self._scores is indexed differently, only modify the index within self._input_ids

---------

Co-authored-by: Andrew Lapp <andrew@rew.la>
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-13 12:26:07 -05:00
Andrei Betlen
f7cdf78788 Update llama.cpp 2024-02-13 12:24:00 -05:00
Andrei Betlen
68fb71b6a2 fix: missing generation_prompt in chatml-function-calling 2024-02-13 03:24:41 -05:00
Andrei Betlen
4b0e3320bd fix: minor formatting bugs for chatml-function-calling 2024-02-13 03:11:35 -05:00
Andrei Betlen
6fe8b427e1 Bump version 2024-02-13 02:46:52 -05:00
Andrei Betlen
d1822fed6b fix: Don't change order of json schema object properties unless prop_order is passed, Closes #1180 2024-02-13 02:44:00 -05:00
Andrei Betlen
d605875772 Bump version 2024-02-12 16:28:30 -05:00
Andrei Betlen
cb791716b4 fix: Always set logits_all = True when using speculative decoding 2024-02-12 16:19:05 -05:00
Andrei
153a0049d9
feat: Generic chatml Function Calling (#957)
* Add demo notebook

* Add initial chat handler

* Update OpenAI types

* Add generic chatml function calling (wip)

* Update chatml generic function calling.

* Progress on auto-tool calls

* fix streaming functions

* Remove print statements

* fix: Suppress output from llama.cpp init and grammar creation

* Add OpenAI v1 python api compatible chat completion function

* Support non-streaming multi-tool calls

* Format

* Include function_call in response.
2024-02-12 15:56:07 -05:00
Andrei Betlen
69413ce08e Update llama.cpp 2024-02-11 19:00:17 -05:00
Connor
a05d90446f
fix: Circular dependancy preventing early Llama object free (#1176)
commit 901827013b introduced a cyclic dependency
within Llama objects. That change causes old models to linger in memory longer
than necessary, thereby creating memory bloat in most applications attempting
to switch between models at runtime. This patch simply removes the problematic
line, allowing models to deallocate without relying on GC. One might also
consider combining `weakref.ref` with a `@property` if the `llama` attribute is
absolutely necessary to expose in the tokenizer class.
2024-02-11 13:57:57 -05:00
Andrei Betlen
4abb8c9386 Merge branch 'main' of github.com:abetlen/llama_cpp_python into main 2024-02-09 13:32:31 -05:00
Andrei Betlen
e16f06e6eb fix: revert _create_completions. 2024-02-09 02:02:13 -05:00
Andrei Betlen
85d3374b4d fix: broken import 2024-02-08 01:13:28 -05:00
Andrei Betlen
b5fca911b5 feat: Move tokenizer to own module 2024-02-08 01:08:18 -05:00
Jeffrey Fong
901827013b
feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class (#1078)
* convert functionary-v1 chat handler to use hf autotokenizer

* add hf_tokenizer + inteegrate functionary-v1.4 prompt template

* integrate functionary v2 prompt template

* update readme

* set up parallel function calling wip

* set up parallel function calling

* Update README.md

* Update README.md

* refactor tokenizers

* include old functionary handler for backward compatibility

* add hf_tokenizer_path in server ModelSettings

* convert functionary-v1 chat handler to use hf autotokenizer

* add hf_tokenizer + inteegrate functionary-v1.4 prompt template

* integrate functionary v2 prompt template

* update readme

* set up parallel function calling wip

* resolve merge conflict

* Update README.md

* Update README.md

* refactor tokenizers

* include old functionary handler for backward compatibility

* add hf_tokenizer_path in server ModelSettings

* Cleanup PR, fix breaking changes

* Use hf_pretrained_model_name_or_path for tokenizer

* fix hf tokenizer in streaming

* update README

* refactor offset mapping

---------

Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-07 20:07:03 -05:00
Andrei Betlen
34f31040f6 Bump version 2024-02-06 12:47:59 -05:00
Andrei Betlen
59760c85ed fix: Use llama_log_callback to avoid suppress_stdout_stderr 2024-02-05 21:52:12 -05:00
Andrei Betlen
3553b14670 Update llama.cpp 2024-02-05 13:26:50 -05:00
Andrei
7467f129e5
Revert "Fix: fileno error google colab (#729) (#1156)" (#1157)
This reverts commit bebfba0f08.
2024-02-02 12:18:55 -05:00
Dulsara
bebfba0f08
Fix: fileno error google colab (#729) (#1156)
Instead of using a devnull just made a dummy class with a 'write()' method that does nothing.
2024-02-02 12:05:46 -05:00
Andrei Betlen
3322eadbf3 Bump version 2024-01-31 15:10:18 -05:00
Andrei
fb762a6041
Add speculative decoding (#1120)
* Add draft model param to llama class, implement basic prompt lookup decoding draft model

* Use samplingcontext for sampling

* Use 1d array

* Use draft model for sampling

* Fix dumb mistake

* Allow for later extensions to the LlamaDraftModel api

* Cleanup

* Adaptive candidate prediction

* Update implementation to match hf transformers

* Tuning

* Fix bug where last token was not used for ngram prediction

* Remove heuristic for num_pred_tokens (no benefit)

* fix: n_candidates bug.

* Add draft_model_num_pred_tokens server setting

* Cleanup

* Update README
2024-01-31 14:08:14 -05:00
Andrei Betlen
71e3e4c435 Update llama.cpp 2024-01-31 10:41:42 -05:00