Douglas Hanley
d7a67917ba
feat: Support batch embeddings ( #1186 )
...
* handle batched embeddings
* fix normalization issue
* fix type hints, ensure no breaking changes to embed
* Clear kv cache / reset internal state after embedding complete
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-14 04:26:09 -05:00
Andrew Lapp
d6be5333e1
fix: sample idx off-by-one error for logit_processors ( #1179 )
...
* fix sample_idx off-by-one error
* self._scores is indexed differently, only modify the index within self._input_ids
---------
Co-authored-by: Andrew Lapp <andrew@rew.la>
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-13 12:26:07 -05:00
Andrei Betlen
cb791716b4
fix: Always set logits_all = True when using speculative decoding
2024-02-12 16:19:05 -05:00
Andrei
153a0049d9
feat: Generic chatml Function Calling ( #957 )
...
* Add demo notebook
* Add initial chat handler
* Update OpenAI types
* Add generic chatml function calling (wip)
* Update chatml generic function calling.
* Progress on auto-tool calls
* fix streaming functions
* Remove print statements
* fix: Suppress output from llama.cpp init and grammar creation
* Add OpenAI v1 python api compatible chat completion function
* Support non-streaming multi-tool calls
* Format
* Include function_call in response.
2024-02-12 15:56:07 -05:00
Andrei Betlen
4abb8c9386
Merge branch 'main' of github.com:abetlen/llama_cpp_python into main
2024-02-09 13:32:31 -05:00
Andrei Betlen
e16f06e6eb
fix: revert _create_completions.
2024-02-09 02:02:13 -05:00
Andrei Betlen
b5fca911b5
feat: Move tokenizer to own module
2024-02-08 01:08:18 -05:00
Jeffrey Fong
901827013b
feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class ( #1078 )
...
* convert functionary-v1 chat handler to use hf autotokenizer
* add hf_tokenizer + inteegrate functionary-v1.4 prompt template
* integrate functionary v2 prompt template
* update readme
* set up parallel function calling wip
* set up parallel function calling
* Update README.md
* Update README.md
* refactor tokenizers
* include old functionary handler for backward compatibility
* add hf_tokenizer_path in server ModelSettings
* convert functionary-v1 chat handler to use hf autotokenizer
* add hf_tokenizer + inteegrate functionary-v1.4 prompt template
* integrate functionary v2 prompt template
* update readme
* set up parallel function calling wip
* resolve merge conflict
* Update README.md
* Update README.md
* refactor tokenizers
* include old functionary handler for backward compatibility
* add hf_tokenizer_path in server ModelSettings
* Cleanup PR, fix breaking changes
* Use hf_pretrained_model_name_or_path for tokenizer
* fix hf tokenizer in streaming
* update README
* refactor offset mapping
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-07 20:07:03 -05:00
Andrei Betlen
59760c85ed
fix: Use llama_log_callback to avoid suppress_stdout_stderr
2024-02-05 21:52:12 -05:00
Andrei
fb762a6041
Add speculative decoding ( #1120 )
...
* Add draft model param to llama class, implement basic prompt lookup decoding draft model
* Use samplingcontext for sampling
* Use 1d array
* Use draft model for sampling
* Fix dumb mistake
* Allow for later extensions to the LlamaDraftModel api
* Cleanup
* Adaptive candidate prediction
* Update implementation to match hf transformers
* Tuning
* Fix bug where last token was not used for ngram prediction
* Remove heuristic for num_pred_tokens (no benefit)
* fix: n_candidates bug.
* Add draft_model_num_pred_tokens server setting
* Cleanup
* Update README
2024-01-31 14:08:14 -05:00
Andrei
da003d8768
Automatically set chat format from gguf ( #1110 )
...
* Use jinja formatter to load chat format from gguf
* Fix off-by-one error in metadata loader
* Implement chat format auto-detection
2024-01-29 14:22:23 -05:00
Andrei Betlen
9677a1f2c8
fix: Check order
2024-01-23 22:28:03 -05:00
Andrei Betlen
4d6b2f7b91
fix: format
2024-01-23 22:08:27 -05:00
Phil H
fe5d6ea648
fix: GGUF metadata KV overrides, re #1011 ( #1116 )
...
* kv overrides another attempt
* add sentinel element, simplify array population
* ensure sentinel element is zeroed
2024-01-23 22:00:38 -05:00
Andrei Betlen
5a34c57e54
feat: Expose gguf model metadata in metadata property
2024-01-19 10:46:03 -05:00
Andrei Betlen
3babe3512c
Fix mirostat sampling
2024-01-19 08:31:59 -05:00
Andrei Betlen
48c3b77e6f
Offload KQV by default
2024-01-18 11:08:57 -05:00
Andrei Betlen
7b46bb5a78
Re-order classes in llama.py
2024-01-17 09:16:13 -05:00
Andrei Betlen
cc4630e66f
Move helper classes to _internals submodule
2024-01-17 09:14:00 -05:00
Andrei Betlen
3b92419132
Move cache classes to llama_cache submodule.
2024-01-17 09:09:12 -05:00
Andrei Betlen
84615adbc6
Add split_mode option. Closes #1085
2024-01-15 12:49:20 -05:00
Phil H
76aafa6149
Implement GGUF metadata KV overrides ( #1011 )
...
* Implement GGUF metadata overrides
* whitespace fix
* Fix kv overrides.
* Fix pointer and pickle
* Match llama.cpp kv_overrides cli argument
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-01-15 12:29:29 -05:00
Stephen Hankinson
df3be58d6c
Add ability to pass in penalize_nl param ( #1068 )
2024-01-10 02:46:27 -05:00
Andrei Betlen
d9a1d90fd7
Fix typo
2023-12-22 15:12:27 -05:00
swg
4b01a873ef
server: Support none defaulting to infinity for completions ( #111 )
...
* Support defaulting to infinity or -1 for chat completions
* Check if completion_tokens is none in error handler.
* fix: max_tokens in create completion should match openai spec
* Fix __call__
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2023-12-22 14:05:13 -05:00
twaka
2f03fb0231
fix text_offset of multi-token characters ( #1037 )
...
* fix text_offsets for bytes tokens
* fix
2023-12-22 00:03:29 -05:00
Andrei Betlen
a05b4da80a
fix: float32 is not JSON serializable when streaming logits.
2023-12-18 18:40:36 -05:00
Andrei Betlen
095c650006
Add offload_kqv option to llama and server
2023-12-18 15:36:09 -05:00
Andrei Betlen
472b344ae3
Remove unnused import
2023-12-18 15:32:40 -05:00
kddubey
6b2e0e05b4
perf: Don't convert logprobs arrays to lists ( #1021 )
2023-12-18 14:28:12 -05:00
Brandon Roberts
62944df142
Bugfix: Remove f16_kv, add offload_kqv field ( #1019 )
...
F16_KV appears to have been removed here: af99c6fbfc
This addresses two issues:
- #995 which just requests to add the KV cache offloading param
- #1006 a NULL ptr exception when using the embeddings (introduced by
leaving f16_kv in the fields struct)
2023-12-18 14:27:11 -05:00
Daniele Morotti
f1c631dc53
Bug fixed with n_ctx=0 ( #1015 )
...
If the n_ctx is set to 0 the code should use the maximum context length of the selected model, but it didn't work. There was a problem with the initialization of this parameter and a related problem with 'n_batch'.
2023-12-16 18:59:50 -05:00
kddubey
5a8944672f
Fix logits_to_logprobs for 2-D and 3-D logits ( #1002 )
...
* Fix logits_to_logprobs for 2-D and 3-D logits
* Set dtype to single
* Test size
2023-12-16 18:59:26 -05:00
Tanner Hobson
ef22e478db
Replace logits_to_logprobs implementation with numpy equivalent to llama.cpp ( #991 )
...
See #990 . This change makes the logits_to_logprobs function equivalent to the version in the llama.cpp repository. It uses numpy so it's much faster than the previous version.
2023-12-11 20:46:27 -05:00
Andrei Betlen
ec26f364cc
Remove f16_kv
2023-12-11 10:25:37 -05:00
kddubey
b069d06346
Fix #891 ( #952 )
2023-11-29 05:39:52 -05:00
Andrei Betlen
6308f21d5e
docs: Update Llama docs
2023-11-26 15:56:40 -05:00
Andrei Betlen
4026166e68
docs: Update completion and chat_completion parameter docstrings
2023-11-24 03:24:19 -05:00
Andrei Betlen
b6bb7ac76a
docs: Add Llama class example
2023-11-22 23:10:04 -05:00
Andrei Betlen
7a3f87846b
Format
2023-11-21 04:02:20 -05:00
Andrei Betlen
422ebc89ce
Fix: Add logit_bias to all completion api methods
2023-11-21 04:01:36 -05:00
Andrei Betlen
07e47f55ba
Add support for logit_bias outside of server api. Closes #827
2023-11-21 03:59:46 -05:00
TK-Master
b8438f70b5
Added support for min_p ( #921 )
...
* Added support for min_p
My small contribution to this great project.
Ref: https://github.com/ggerganov/llama.cpp/pull/3841
Closes: https://github.com/abetlen/llama-cpp-python/issues/911
* Fix for negative temp (sample_softmax)
2023-11-20 23:21:33 -05:00
Andrei Betlen
a34d480141
Fix #929
2023-11-20 22:50:59 -05:00
Andrei Betlen
6f0b0b1b84
Fix sampling bug when logits_all=False
2023-11-10 05:15:41 -05:00
Andrei Betlen
d9b38e3e3a
Potential bugfix for eval
2023-11-10 04:41:19 -05:00
Andrei Betlen
e7962d2c73
Fix: default max_tokens matches openai api (16 for completion, max length for chat completion)
2023-11-10 02:49:27 -05:00
Andrei Betlen
fd41ed3a90
Add set_seed to Llama class
2023-11-08 11:09:41 -05:00
Andrei Betlen
ca4cb88351
Fix destructor NoneType is not callable error
2023-11-08 11:05:45 -05:00
Andrei Betlen
b30b9c338b
Add JSON mode support. Closes #881
2023-11-08 00:07:16 -05:00