Andrei Betlen
d99a6ba607
fix: segfault for models without eos / bos tokens. Closes #1463
2024-05-16 00:37:27 -04:00
twaka
5212fb08ae
feat: add MinTokensLogitProcessor and min_tokens argument to server ( #1333 )
...
* implement min_tokens
* set default to 0
* pass min_tokens
* fix
* remove copy
* implement MinTokensLogitsProcessor
* format
* fix condition
2024-05-14 09:50:53 -04:00
Sigbjørn Skjæret
389e09c2f5
misc: Remove unnecessary metadata lookups ( #1448 )
...
Special tokens are already mapped from metadata by llama.cpp
2024-05-14 09:44:09 -04:00
Sigbjørn Skjæret
5ab40e6167
feat: Support multiple chat templates - step 1 ( #1396 )
...
* Support multiple chat templates - step 1
As a first step, allow user to to select template from metadata with chat_format parameter in the form of `chat_template.name`.
* register chat templates to self.chat_formats instead of globally
* Don't expose internal chat handlers yet
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-05-09 09:49:09 -04:00
Sigbjørn Skjæret
4a7122d22f
feat: fill-in-middle support ( #1386 )
...
* Proper fill-in-middle support
Use prefix/middle/suffix tokens when metadata is present in GGUF, like f.ex. in [this](https://huggingface.co/CISCai/CodeQwen1.5-7B-Chat-SOTA-GGUF ) one.
* fall back to internal prefix/middle/suffix id
In some cases llama.cpp will make a guess at fim tokens, use them if there's no metadata.
* typo--
* don't insert special tokens that are not there in suffix
Note: add_bos is misnamed, it's actually add_special and can cause several special tokens to be added to the token list (the special parameter is actually parse_special).
* don't add/parse any special tokens when using fim
I've left original behavior when no fim tokens are found, but this should perhaps be re-evaluated.
* don't append suffix to prompt_tokens unless fim tokens are detected
* make sure we only do this for fim
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-05-08 02:26:22 -04:00
Bruno Alvisio
a50d24e3a7
fix: chat_format log where auto-detected format prints None
( #1434 )
2024-05-08 02:19:35 -04:00
Andrei Betlen
9f7a85571a
fix: Use memmove to copy str_value kv_override. Closes #1417
2024-05-03 19:07:50 -04:00
Andrei Betlen
29b6e9a5c8
fix: wrong parameter for flash attention in pickle __getstate__
2024-04-30 09:32:47 -04:00
Andrei Betlen
22d77eefd2
feat: Add option to enable flash_attn
to Lllama params and ModelSettings
2024-04-30 09:29:16 -04:00
Andrei Betlen
a411612b38
feat: Add support for str type kv_overrides
2024-04-27 23:42:19 -04:00
Douglas Hanley
f6ed21f9a2
feat: Allow for possibly non-pooled embeddings ( #1380 )
...
* allow for possibly non-pooled embeddings
* add more to embeddings section in README.md
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-04-25 21:32:44 -04:00
Andrei Betlen
d40a250ef3
feat: Use new llama_token_is_eog in create_completions
2024-04-22 00:35:47 -04:00
Andrei Betlen
cc81afebf0
feat: Add stopping_criteria to ChatFormatter, allow stopping on arbitrary token ids, fixes llama3 instruct
2024-04-20 00:00:53 -04:00
tc-wolf
4924455dec
feat: Make saved state more compact on-disk ( #1296 )
...
* State load/save changes
- Only store up to `n_tokens` logits instead of full `(n_ctx, n_vocab)`
sized array.
- Difference between ~350MB and ~1500MB for example prompt with ~300
tokens (makes sense lol)
- Auto-formatting changes
* Back out formatting changes
2024-04-17 10:06:50 -04:00
ddh0
c96b2daebf
feat: Use all available CPUs for batch processing ( #1345 )
2024-04-17 10:05:54 -04:00
Andrei Betlen
bb65b4d764
fix: pass correct type to chat handlers for chat completion logprobs
2024-04-10 03:41:55 -04:00
Andrei Betlen
8649d7671b
fix: segfault when logits_all=False. Closes #1319
2024-04-03 15:30:31 -04:00
Limour
f165048a69
feat: add support for KV cache quantization options ( #1307 )
...
* add KV cache quantization options
https://github.com/abetlen/llama-cpp-python/discussions/1220
https://github.com/abetlen/llama-cpp-python/issues/1305
* Add ggml_type
* Use ggml_type instead of string for quantization
* Add server support
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2024-04-01 10:19:28 -04:00
windspirit95
aa9f1ae011
feat: Add logprobs support to chat completions ( #1311 )
...
* Add logprobs return in ChatCompletionResponse
* Fix duplicate field
* Set default to false
* Simplify check
* Add server example
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2024-03-31 13:30:13 -04:00
Andrei Betlen
4084aabe86
fix: set default pooling type to unspecified
2024-03-14 10:04:57 -04:00
Andrei Betlen
d318cc8b83
fix: Set default pooling_type to mean, check for null pointer.
2024-03-14 09:17:41 -04:00
Douglas Hanley
2811014bae
feat: Switch embed to llama_get_embeddings_seq ( #1263 )
...
* switch to llama_get_embeddings_seq
* Remove duplicate definition of llama_get_embeddings_seq
Co-authored-by: Andrei <abetlen@gmail.com>
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-03-08 20:59:35 -05:00
Andrei Betlen
93dc56ace8
Update llama.cpp
2024-03-06 01:32:00 -05:00
Andrei Betlen
97aa3a153d
docs: Add information re: auto chat formats. Closes #1236
2024-03-01 13:10:25 -05:00
Andrei Betlen
f062a7f51d
feat: Update llama.cpp
2024-03-01 12:57:16 -05:00
Sigbjørn Skjæret
c36ab15e68
fix: eos/bos_token set correctly for Jinja2ChatFormatter and automatic chat formatter ( #1230 )
...
The token strings were not correctly retrieved (empty).
2024-02-28 01:30:31 -05:00
Andrei Betlen
2292af5796
feat: Update llama.cpp
2024-02-25 16:53:58 -05:00
Andrei Betlen
47bad30dd7
fix: LlamaHFTokenizer now receives pre_tokens
2024-02-23 12:23:24 -05:00
Andrei Betlen
db776a885c
fix: module 'llama_cpp.llama_cpp' has no attribute 'c_uint8'
2024-02-23 11:24:53 -05:00
Andrei Betlen
e6d6260a91
fix: Update from_pretrained defaults to match hf_hub_download
2024-02-22 00:10:23 -05:00
Andrei
7f51b6071f
feat(low-level-api): Improve API static type-safety and performance ( #1205 )
2024-02-21 16:25:38 -05:00
Andrei
0f8aa4ab5c
feat: Pull models directly from huggingface ( #1206 )
...
* Add from_pretrained method to Llama class
* Update docs
* Merge filename and pattern
2024-02-21 16:25:10 -05:00
Andrei Betlen
53f6f5f415
fix: self.numa missing
2024-02-17 01:02:33 -05:00
Andrei Betlen
fdce078cb9
feat: Update llama.cpp
2024-02-17 00:37:51 -05:00
Andrei Betlen
0ce66bc080
fix: create_embedding broken response for input type str
2024-02-15 16:09:48 -05:00
Douglas Hanley
7bb91f025f
fix: Incorporate embedding pooling layer fixes ( #1194 )
...
* remove division by token count
* truncate to n_batch, not n_ctx
2024-02-15 15:16:30 -05:00
Douglas Hanley
d7a67917ba
feat: Support batch embeddings ( #1186 )
...
* handle batched embeddings
* fix normalization issue
* fix type hints, ensure no breaking changes to embed
* Clear kv cache / reset internal state after embedding complete
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-14 04:26:09 -05:00
Andrew Lapp
d6be5333e1
fix: sample idx off-by-one error for logit_processors ( #1179 )
...
* fix sample_idx off-by-one error
* self._scores is indexed differently, only modify the index within self._input_ids
---------
Co-authored-by: Andrew Lapp <andrew@rew.la>
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-13 12:26:07 -05:00
Andrei Betlen
cb791716b4
fix: Always set logits_all = True when using speculative decoding
2024-02-12 16:19:05 -05:00
Andrei
153a0049d9
feat: Generic chatml Function Calling ( #957 )
...
* Add demo notebook
* Add initial chat handler
* Update OpenAI types
* Add generic chatml function calling (wip)
* Update chatml generic function calling.
* Progress on auto-tool calls
* fix streaming functions
* Remove print statements
* fix: Suppress output from llama.cpp init and grammar creation
* Add OpenAI v1 python api compatible chat completion function
* Support non-streaming multi-tool calls
* Format
* Include function_call in response.
2024-02-12 15:56:07 -05:00
Andrei Betlen
4abb8c9386
Merge branch 'main' of github.com:abetlen/llama_cpp_python into main
2024-02-09 13:32:31 -05:00
Andrei Betlen
e16f06e6eb
fix: revert _create_completions.
2024-02-09 02:02:13 -05:00
Andrei Betlen
b5fca911b5
feat: Move tokenizer to own module
2024-02-08 01:08:18 -05:00
Jeffrey Fong
901827013b
feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class ( #1078 )
...
* convert functionary-v1 chat handler to use hf autotokenizer
* add hf_tokenizer + inteegrate functionary-v1.4 prompt template
* integrate functionary v2 prompt template
* update readme
* set up parallel function calling wip
* set up parallel function calling
* Update README.md
* Update README.md
* refactor tokenizers
* include old functionary handler for backward compatibility
* add hf_tokenizer_path in server ModelSettings
* convert functionary-v1 chat handler to use hf autotokenizer
* add hf_tokenizer + inteegrate functionary-v1.4 prompt template
* integrate functionary v2 prompt template
* update readme
* set up parallel function calling wip
* resolve merge conflict
* Update README.md
* Update README.md
* refactor tokenizers
* include old functionary handler for backward compatibility
* add hf_tokenizer_path in server ModelSettings
* Cleanup PR, fix breaking changes
* Use hf_pretrained_model_name_or_path for tokenizer
* fix hf tokenizer in streaming
* update README
* refactor offset mapping
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-07 20:07:03 -05:00
Andrei Betlen
59760c85ed
fix: Use llama_log_callback to avoid suppress_stdout_stderr
2024-02-05 21:52:12 -05:00
Andrei
fb762a6041
Add speculative decoding ( #1120 )
...
* Add draft model param to llama class, implement basic prompt lookup decoding draft model
* Use samplingcontext for sampling
* Use 1d array
* Use draft model for sampling
* Fix dumb mistake
* Allow for later extensions to the LlamaDraftModel api
* Cleanup
* Adaptive candidate prediction
* Update implementation to match hf transformers
* Tuning
* Fix bug where last token was not used for ngram prediction
* Remove heuristic for num_pred_tokens (no benefit)
* fix: n_candidates bug.
* Add draft_model_num_pred_tokens server setting
* Cleanup
* Update README
2024-01-31 14:08:14 -05:00
Andrei
da003d8768
Automatically set chat format from gguf ( #1110 )
...
* Use jinja formatter to load chat format from gguf
* Fix off-by-one error in metadata loader
* Implement chat format auto-detection
2024-01-29 14:22:23 -05:00
Andrei Betlen
9677a1f2c8
fix: Check order
2024-01-23 22:28:03 -05:00
Andrei Betlen
4d6b2f7b91
fix: format
2024-01-23 22:08:27 -05:00
Phil H
fe5d6ea648
fix: GGUF metadata KV overrides, re #1011 ( #1116 )
...
* kv overrides another attempt
* add sentinel element, simplify array population
* ensure sentinel element is zeroed
2024-01-23 22:00:38 -05:00