baalajimaestro/llama.cpp

Author	SHA1	Message	Date
Andrei Betlen	d99a6ba607	fix: segfault for models without eos / bos tokens. Closes #1463	2024-05-16 00:37:27 -04:00
twaka	5212fb08ae	feat: add MinTokensLogitProcessor and min_tokens argument to server (#1333 ) * implement min_tokens * set default to 0 * pass min_tokens * fix * remove copy * implement MinTokensLogitsProcessor * format * fix condition	2024-05-14 09:50:53 -04:00
Sigbjørn Skjæret	389e09c2f5	misc: Remove unnecessary metadata lookups (#1448 ) Special tokens are already mapped from metadata by llama.cpp	2024-05-14 09:44:09 -04:00
Sigbjørn Skjæret	5ab40e6167	feat: Support multiple chat templates - step 1 (#1396 ) * Support multiple chat templates - step 1 As a first step, allow user to to select template from metadata with chat_format parameter in the form of `chat_template.name`. * register chat templates to self.chat_formats instead of globally * Don't expose internal chat handlers yet --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-05-09 09:49:09 -04:00
Sigbjørn Skjæret	4a7122d22f	feat: fill-in-middle support (#1386 ) * Proper fill-in-middle support Use prefix/middle/suffix tokens when metadata is present in GGUF, like f.ex. in [this](https://huggingface.co/CISCai/CodeQwen1.5-7B-Chat-SOTA-GGUF) one. * fall back to internal prefix/middle/suffix id In some cases llama.cpp will make a guess at fim tokens, use them if there's no metadata. * typo-- * don't insert special tokens that are not there in suffix Note: add_bos is misnamed, it's actually add_special and can cause several special tokens to be added to the token list (the special parameter is actually parse_special). * don't add/parse any special tokens when using fim I've left original behavior when no fim tokens are found, but this should perhaps be re-evaluated. * don't append suffix to prompt_tokens unless fim tokens are detected * make sure we only do this for fim --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-05-08 02:26:22 -04:00
Bruno Alvisio	a50d24e3a7	fix: chat_format log where auto-detected format prints `None` (#1434 )	2024-05-08 02:19:35 -04:00
Andrei Betlen	9f7a85571a	fix: Use memmove to copy str_value kv_override. Closes #1417	2024-05-03 19:07:50 -04:00
Andrei Betlen	29b6e9a5c8	fix: wrong parameter for flash attention in pickle __getstate__	2024-04-30 09:32:47 -04:00
Andrei Betlen	22d77eefd2	feat: Add option to enable `flash_attn` to Lllama params and ModelSettings	2024-04-30 09:29:16 -04:00
Andrei Betlen	a411612b38	feat: Add support for str type kv_overrides	2024-04-27 23:42:19 -04:00
Douglas Hanley	f6ed21f9a2	feat: Allow for possibly non-pooled embeddings (#1380 ) * allow for possibly non-pooled embeddings * add more to embeddings section in README.md --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-04-25 21:32:44 -04:00
Andrei Betlen	d40a250ef3	feat: Use new llama_token_is_eog in create_completions	2024-04-22 00:35:47 -04:00
Andrei Betlen	cc81afebf0	feat: Add stopping_criteria to ChatFormatter, allow stopping on arbitrary token ids, fixes llama3 instruct	2024-04-20 00:00:53 -04:00
tc-wolf	4924455dec	feat: Make saved state more compact on-disk (#1296 ) * State load/save changes - Only store up to `n_tokens` logits instead of full `(n_ctx, n_vocab)` sized array. - Difference between ~350MB and ~1500MB for example prompt with ~300 tokens (makes sense lol) - Auto-formatting changes * Back out formatting changes	2024-04-17 10:06:50 -04:00
ddh0	c96b2daebf	feat: Use all available CPUs for batch processing (#1345 )	2024-04-17 10:05:54 -04:00
Andrei Betlen	bb65b4d764	fix: pass correct type to chat handlers for chat completion logprobs	2024-04-10 03:41:55 -04:00
Andrei Betlen	8649d7671b	fix: segfault when logits_all=False. Closes #1319	2024-04-03 15:30:31 -04:00
Limour	f165048a69	feat: add support for KV cache quantization options (#1307 ) * add KV cache quantization options https://github.com/abetlen/llama-cpp-python/discussions/1220 https://github.com/abetlen/llama-cpp-python/issues/1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-04-01 10:19:28 -04:00
windspirit95	aa9f1ae011	feat: Add logprobs support to chat completions (#1311 ) * Add logprobs return in ChatCompletionResponse * Fix duplicate field * Set default to false * Simplify check * Add server example --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-03-31 13:30:13 -04:00
Andrei Betlen	4084aabe86	fix: set default pooling type to unspecified	2024-03-14 10:04:57 -04:00
Andrei Betlen	d318cc8b83	fix: Set default pooling_type to mean, check for null pointer.	2024-03-14 09:17:41 -04:00
Douglas Hanley	2811014bae	feat: Switch embed to llama_get_embeddings_seq (#1263 ) * switch to llama_get_embeddings_seq * Remove duplicate definition of llama_get_embeddings_seq Co-authored-by: Andrei <abetlen@gmail.com> --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-03-08 20:59:35 -05:00
Andrei Betlen	93dc56ace8	Update llama.cpp	2024-03-06 01:32:00 -05:00
Andrei Betlen	97aa3a153d	docs: Add information re: auto chat formats. Closes #1236	2024-03-01 13:10:25 -05:00
Andrei Betlen	f062a7f51d	feat: Update llama.cpp	2024-03-01 12:57:16 -05:00
Sigbjørn Skjæret	c36ab15e68	fix: eos/bos_token set correctly for Jinja2ChatFormatter and automatic chat formatter (#1230 ) The token strings were not correctly retrieved (empty).	2024-02-28 01:30:31 -05:00
Andrei Betlen	2292af5796	feat: Update llama.cpp	2024-02-25 16:53:58 -05:00
Andrei Betlen	47bad30dd7	fix: LlamaHFTokenizer now receives pre_tokens	2024-02-23 12:23:24 -05:00
Andrei Betlen	db776a885c	fix: module 'llama_cpp.llama_cpp' has no attribute 'c_uint8'	2024-02-23 11:24:53 -05:00
Andrei Betlen	e6d6260a91	fix: Update from_pretrained defaults to match hf_hub_download	2024-02-22 00:10:23 -05:00
Andrei	7f51b6071f	feat(low-level-api): Improve API static type-safety and performance (#1205 )	2024-02-21 16:25:38 -05:00
Andrei	0f8aa4ab5c	feat: Pull models directly from huggingface (#1206 ) * Add from_pretrained method to Llama class * Update docs * Merge filename and pattern	2024-02-21 16:25:10 -05:00
Andrei Betlen	53f6f5f415	fix: self.numa missing	2024-02-17 01:02:33 -05:00
Andrei Betlen	fdce078cb9	feat: Update llama.cpp	2024-02-17 00:37:51 -05:00
Andrei Betlen	0ce66bc080	fix: create_embedding broken response for input type str	2024-02-15 16:09:48 -05:00
Douglas Hanley	7bb91f025f	fix: Incorporate embedding pooling layer fixes (#1194 ) * remove division by token count * truncate to n_batch, not n_ctx	2024-02-15 15:16:30 -05:00
Douglas Hanley	d7a67917ba	feat: Support batch embeddings (#1186 ) * handle batched embeddings * fix normalization issue * fix type hints, ensure no breaking changes to embed * Clear kv cache / reset internal state after embedding complete --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-14 04:26:09 -05:00
Andrew Lapp	d6be5333e1	fix: sample idx off-by-one error for logit_processors (#1179 ) * fix sample_idx off-by-one error * self._scores is indexed differently, only modify the index within self._input_ids --------- Co-authored-by: Andrew Lapp <andrew@rew.la> Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-13 12:26:07 -05:00
Andrei Betlen	cb791716b4	fix: Always set logits_all = True when using speculative decoding	2024-02-12 16:19:05 -05:00
Andrei	153a0049d9	feat: Generic chatml Function Calling (#957 ) * Add demo notebook * Add initial chat handler * Update OpenAI types * Add generic chatml function calling (wip) * Update chatml generic function calling. * Progress on auto-tool calls * fix streaming functions * Remove print statements * fix: Suppress output from llama.cpp init and grammar creation * Add OpenAI v1 python api compatible chat completion function * Support non-streaming multi-tool calls * Format * Include function_call in response.	2024-02-12 15:56:07 -05:00
Andrei Betlen	4abb8c9386	Merge branch 'main' of github.com:abetlen/llama_cpp_python into main	2024-02-09 13:32:31 -05:00
Andrei Betlen	e16f06e6eb	fix: revert _create_completions.	2024-02-09 02:02:13 -05:00
Andrei Betlen	b5fca911b5	feat: Move tokenizer to own module	2024-02-08 01:08:18 -05:00
Jeffrey Fong	901827013b	feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class (#1078 ) * convert functionary-v1 chat handler to use hf autotokenizer * add hf_tokenizer + inteegrate functionary-v1.4 prompt template * integrate functionary v2 prompt template * update readme * set up parallel function calling wip * set up parallel function calling * Update README.md * Update README.md * refactor tokenizers * include old functionary handler for backward compatibility * add hf_tokenizer_path in server ModelSettings * convert functionary-v1 chat handler to use hf autotokenizer * add hf_tokenizer + inteegrate functionary-v1.4 prompt template * integrate functionary v2 prompt template * update readme * set up parallel function calling wip * resolve merge conflict * Update README.md * Update README.md * refactor tokenizers * include old functionary handler for backward compatibility * add hf_tokenizer_path in server ModelSettings * Cleanup PR, fix breaking changes * Use hf_pretrained_model_name_or_path for tokenizer * fix hf tokenizer in streaming * update README * refactor offset mapping --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-07 20:07:03 -05:00
Andrei Betlen	59760c85ed	fix: Use llama_log_callback to avoid suppress_stdout_stderr	2024-02-05 21:52:12 -05:00
Andrei	fb762a6041	Add speculative decoding (#1120 ) * Add draft model param to llama class, implement basic prompt lookup decoding draft model * Use samplingcontext for sampling * Use 1d array * Use draft model for sampling * Fix dumb mistake * Allow for later extensions to the LlamaDraftModel api * Cleanup * Adaptive candidate prediction * Update implementation to match hf transformers * Tuning * Fix bug where last token was not used for ngram prediction * Remove heuristic for num_pred_tokens (no benefit) * fix: n_candidates bug. * Add draft_model_num_pred_tokens server setting * Cleanup * Update README	2024-01-31 14:08:14 -05:00
Andrei	da003d8768	Automatically set chat format from gguf (#1110 ) * Use jinja formatter to load chat format from gguf * Fix off-by-one error in metadata loader * Implement chat format auto-detection	2024-01-29 14:22:23 -05:00
Andrei Betlen	9677a1f2c8	fix: Check order	2024-01-23 22:28:03 -05:00
Andrei Betlen	4d6b2f7b91	fix: format	2024-01-23 22:08:27 -05:00
Phil H	fe5d6ea648	fix: GGUF metadata KV overrides, re #1011 (#1116 ) * kv overrides another attempt * add sentinel element, simplify array population * ensure sentinel element is zeroed	2024-01-23 22:00:38 -05:00

1 2 3 4 5 ...

333 commits