Andrei Betlen
b1637c2319
Bump version
2024-02-13 12:35:04 -05:00
Andrew Lapp
d6be5333e1
fix: sample idx off-by-one error for logit_processors ( #1179 )
...
* fix sample_idx off-by-one error
* self._scores is indexed differently, only modify the index within self._input_ids
---------
Co-authored-by: Andrew Lapp <andrew@rew.la>
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-13 12:26:07 -05:00
Andrei Betlen
f7cdf78788
Update llama.cpp
2024-02-13 12:24:00 -05:00
Andrei Betlen
68fb71b6a2
fix: missing generation_prompt in chatml-function-calling
2024-02-13 03:24:41 -05:00
Andrei Betlen
4b0e3320bd
fix: minor formatting bugs for chatml-function-calling
2024-02-13 03:11:35 -05:00
Andrei Betlen
6fe8b427e1
Bump version
2024-02-13 02:46:52 -05:00
Andrei Betlen
d1822fed6b
fix: Don't change order of json schema object properties unless prop_order is passed, Closes #1180
2024-02-13 02:44:00 -05:00
Andrei Betlen
5efc45bdfd
Update llama.cpp
2024-02-13 02:43:07 -05:00
Andrei Betlen
4348a6cdf0
docs: Fix typo
2024-02-13 02:04:54 -05:00
Andrei Betlen
d605875772
Bump version
2024-02-12 16:28:30 -05:00
Andrei Betlen
b82b0e1014
docs: Temporarily revert function calling docs
2024-02-12 16:27:43 -05:00
Andrei Betlen
cb791716b4
fix: Always set logits_all = True when using speculative decoding
2024-02-12 16:19:05 -05:00
Andrei
153a0049d9
feat: Generic chatml Function Calling ( #957 )
...
* Add demo notebook
* Add initial chat handler
* Update OpenAI types
* Add generic chatml function calling (wip)
* Update chatml generic function calling.
* Progress on auto-tool calls
* fix streaming functions
* Remove print statements
* fix: Suppress output from llama.cpp init and grammar creation
* Add OpenAI v1 python api compatible chat completion function
* Support non-streaming multi-tool calls
* Format
* Include function_call in response.
2024-02-12 15:56:07 -05:00
Andrei Betlen
69413ce08e
Update llama.cpp
2024-02-11 19:00:17 -05:00
Andrei Betlen
9368670639
Update llama.cpp
2024-02-11 14:02:46 -05:00
Connor
a05d90446f
fix: Circular dependancy preventing early Llama object free ( #1176 )
...
commit 901827013b
introduced a cyclic dependency
within Llama objects. That change causes old models to linger in memory longer
than necessary, thereby creating memory bloat in most applications attempting
to switch between models at runtime. This patch simply removes the problematic
line, allowing models to deallocate without relying on GC. One might also
consider combining `weakref.ref` with a `@property` if the `llama` attribute is
absolutely necessary to expose in the tokenizer class.
2024-02-11 13:57:57 -05:00
Akarshan Biswas
918ff27e50
docs: Set the correct command for compiling with syscl support ( #1172 )
2024-02-11 13:55:15 -05:00
Douglas Hanley
19b55ad3e5
feat: use gpu backend for clip if available ( #1175 )
2024-02-11 13:53:59 -05:00
Andrei Betlen
63b0c37836
Update llama.cpp
2024-02-09 13:36:58 -05:00
Andrei Betlen
4abb8c9386
Merge branch 'main' of github.com:abetlen/llama_cpp_python into main
2024-02-09 13:32:31 -05:00
Andrei Betlen
e16f06e6eb
fix: revert _create_completions.
2024-02-09 02:02:13 -05:00
Andrei Betlen
dfc1b17341
Update llama.cpp
2024-02-08 23:38:12 -05:00
Andrei Betlen
5b4ad6c80b
Merge branch 'main' of https://github.com/abetlen/llama-cpp-python into main
2024-02-08 23:34:45 -05:00
Andrei Betlen
85d3374b4d
fix: broken import
2024-02-08 01:13:28 -05:00
Andrei Betlen
b5fca911b5
feat: Move tokenizer to own module
2024-02-08 01:08:18 -05:00
Andrei Betlen
2ef7ba3aed
misc: rename grammar test
2024-02-08 01:07:44 -05:00
Jeffrey Fong
901827013b
feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class ( #1078 )
...
* convert functionary-v1 chat handler to use hf autotokenizer
* add hf_tokenizer + inteegrate functionary-v1.4 prompt template
* integrate functionary v2 prompt template
* update readme
* set up parallel function calling wip
* set up parallel function calling
* Update README.md
* Update README.md
* refactor tokenizers
* include old functionary handler for backward compatibility
* add hf_tokenizer_path in server ModelSettings
* convert functionary-v1 chat handler to use hf autotokenizer
* add hf_tokenizer + inteegrate functionary-v1.4 prompt template
* integrate functionary v2 prompt template
* update readme
* set up parallel function calling wip
* resolve merge conflict
* Update README.md
* Update README.md
* refactor tokenizers
* include old functionary handler for backward compatibility
* add hf_tokenizer_path in server ModelSettings
* Cleanup PR, fix breaking changes
* Use hf_pretrained_model_name_or_path for tokenizer
* fix hf tokenizer in streaming
* update README
* refactor offset mapping
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-07 20:07:03 -05:00
Andrei Betlen
ce12775490
Update llama.cpp
2024-02-06 18:50:56 -05:00
Andrei Betlen
34f31040f6
Bump version
2024-02-06 12:47:59 -05:00
Andrei Betlen
5e3e67af47
Update llama.cpp
2024-02-06 12:44:07 -05:00
Andrei Betlen
310fbf4e49
Update llama.cpp
2024-02-05 22:07:14 -05:00
Andrei Betlen
59760c85ed
fix: Use llama_log_callback to avoid suppress_stdout_stderr
2024-02-05 21:52:12 -05:00
Andrei Betlen
3553b14670
Update llama.cpp
2024-02-05 13:26:50 -05:00
Andrei
7467f129e5
Revert "Fix: fileno error google colab ( #729 ) ( #1156 )" ( #1157 )
...
This reverts commit bebfba0f08
.
2024-02-02 12:18:55 -05:00
Dulsara
bebfba0f08
Fix: fileno error google colab ( #729 ) ( #1156 )
...
Instead of using a devnull just made a dummy class with a 'write()' method that does nothing.
2024-02-02 12:05:46 -05:00
Andrei Betlen
8a5911bd5d
Update llama.cpp
2024-02-02 09:41:27 -05:00
Andrei Betlen
de526d0214
Update llama.cpp
2024-02-01 12:35:31 -05:00
Andrei Betlen
3322eadbf3
Bump version
2024-01-31 15:10:18 -05:00
Andrei Betlen
a8cb34eacd
Update llama.cpp
2024-01-31 15:05:51 -05:00
Andrei
fb762a6041
Add speculative decoding ( #1120 )
...
* Add draft model param to llama class, implement basic prompt lookup decoding draft model
* Use samplingcontext for sampling
* Use 1d array
* Use draft model for sampling
* Fix dumb mistake
* Allow for later extensions to the LlamaDraftModel api
* Cleanup
* Adaptive candidate prediction
* Update implementation to match hf transformers
* Tuning
* Fix bug where last token was not used for ngram prediction
* Remove heuristic for num_pred_tokens (no benefit)
* fix: n_candidates bug.
* Add draft_model_num_pred_tokens server setting
* Cleanup
* Update README
2024-01-31 14:08:14 -05:00
Andrei Betlen
71e3e4c435
Update llama.cpp
2024-01-31 10:41:42 -05:00
Andrei Betlen
2b37d8e438
fix: Run server command. Closes #1143
2024-01-31 10:37:19 -05:00
Andrei Betlen
078cca0361
fix: Pass raise_exception and add_generation_prompt to jinja2 chat template
2024-01-31 08:42:21 -05:00
Andrei Betlen
411494706a
Update llama.cpp
2024-01-31 08:35:21 -05:00
Andrei Betlen
bf9e824922
Bump version
2024-01-30 12:27:27 -05:00
Andrei Betlen
247a16de66
docs: Update README
2024-01-30 12:23:07 -05:00
Andrei Betlen
13b7ced7da
Update llama.cpp
2024-01-30 12:21:41 -05:00
Andrei Betlen
011cd84ded
Update llama.cpp
2024-01-30 09:48:09 -05:00
Andrei
da003d8768
Automatically set chat format from gguf ( #1110 )
...
* Use jinja formatter to load chat format from gguf
* Fix off-by-one error in metadata loader
* Implement chat format auto-detection
2024-01-29 14:22:23 -05:00
Andrei Betlen
059f6b3ac8
docs: fix typos
2024-01-29 11:02:25 -05:00