baalajimaestro/llama.cpp

Author	SHA1	Message	Date
Andrei Betlen	93dc56ace8	Update llama.cpp	2024-03-06 01:32:00 -05:00
Andrei Betlen	97aa3a153d	docs: Add information re: auto chat formats. Closes #1236	2024-03-01 13:10:25 -05:00
Andrei Betlen	f062a7f51d	feat: Update llama.cpp	2024-03-01 12:57:16 -05:00
Sigbjørn Skjæret	c36ab15e68	fix: eos/bos_token set correctly for Jinja2ChatFormatter and automatic chat formatter (#1230 ) The token strings were not correctly retrieved (empty).	2024-02-28 01:30:31 -05:00
Andrei Betlen	2292af5796	feat: Update llama.cpp	2024-02-25 16:53:58 -05:00
Andrei Betlen	47bad30dd7	fix: LlamaHFTokenizer now receives pre_tokens	2024-02-23 12:23:24 -05:00
Andrei Betlen	db776a885c	fix: module 'llama_cpp.llama_cpp' has no attribute 'c_uint8'	2024-02-23 11:24:53 -05:00
Andrei Betlen	e6d6260a91	fix: Update from_pretrained defaults to match hf_hub_download	2024-02-22 00:10:23 -05:00
Andrei	7f51b6071f	feat(low-level-api): Improve API static type-safety and performance (#1205 )	2024-02-21 16:25:38 -05:00
Andrei	0f8aa4ab5c	feat: Pull models directly from huggingface (#1206 ) * Add from_pretrained method to Llama class * Update docs * Merge filename and pattern	2024-02-21 16:25:10 -05:00
Andrei Betlen	53f6f5f415	fix: self.numa missing	2024-02-17 01:02:33 -05:00
Andrei Betlen	fdce078cb9	feat: Update llama.cpp	2024-02-17 00:37:51 -05:00
Andrei Betlen	0ce66bc080	fix: create_embedding broken response for input type str	2024-02-15 16:09:48 -05:00
Douglas Hanley	7bb91f025f	fix: Incorporate embedding pooling layer fixes (#1194 ) * remove division by token count * truncate to n_batch, not n_ctx	2024-02-15 15:16:30 -05:00
Douglas Hanley	d7a67917ba	feat: Support batch embeddings (#1186 ) * handle batched embeddings * fix normalization issue * fix type hints, ensure no breaking changes to embed * Clear kv cache / reset internal state after embedding complete --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-14 04:26:09 -05:00
Andrew Lapp	d6be5333e1	fix: sample idx off-by-one error for logit_processors (#1179 ) * fix sample_idx off-by-one error * self._scores is indexed differently, only modify the index within self._input_ids --------- Co-authored-by: Andrew Lapp <andrew@rew.la> Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-13 12:26:07 -05:00
Andrei Betlen	cb791716b4	fix: Always set logits_all = True when using speculative decoding	2024-02-12 16:19:05 -05:00
Andrei	153a0049d9	feat: Generic chatml Function Calling (#957 ) * Add demo notebook * Add initial chat handler * Update OpenAI types * Add generic chatml function calling (wip) * Update chatml generic function calling. * Progress on auto-tool calls * fix streaming functions * Remove print statements * fix: Suppress output from llama.cpp init and grammar creation * Add OpenAI v1 python api compatible chat completion function * Support non-streaming multi-tool calls * Format * Include function_call in response.	2024-02-12 15:56:07 -05:00
Andrei Betlen	4abb8c9386	Merge branch 'main' of github.com:abetlen/llama_cpp_python into main	2024-02-09 13:32:31 -05:00
Andrei Betlen	e16f06e6eb	fix: revert _create_completions.	2024-02-09 02:02:13 -05:00
Andrei Betlen	b5fca911b5	feat: Move tokenizer to own module	2024-02-08 01:08:18 -05:00
Jeffrey Fong	901827013b	feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class (#1078 ) * convert functionary-v1 chat handler to use hf autotokenizer * add hf_tokenizer + inteegrate functionary-v1.4 prompt template * integrate functionary v2 prompt template * update readme * set up parallel function calling wip * set up parallel function calling * Update README.md * Update README.md * refactor tokenizers * include old functionary handler for backward compatibility * add hf_tokenizer_path in server ModelSettings * convert functionary-v1 chat handler to use hf autotokenizer * add hf_tokenizer + inteegrate functionary-v1.4 prompt template * integrate functionary v2 prompt template * update readme * set up parallel function calling wip * resolve merge conflict * Update README.md * Update README.md * refactor tokenizers * include old functionary handler for backward compatibility * add hf_tokenizer_path in server ModelSettings * Cleanup PR, fix breaking changes * Use hf_pretrained_model_name_or_path for tokenizer * fix hf tokenizer in streaming * update README * refactor offset mapping --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-02-07 20:07:03 -05:00
Andrei Betlen	59760c85ed	fix: Use llama_log_callback to avoid suppress_stdout_stderr	2024-02-05 21:52:12 -05:00
Andrei	fb762a6041	Add speculative decoding (#1120 ) * Add draft model param to llama class, implement basic prompt lookup decoding draft model * Use samplingcontext for sampling * Use 1d array * Use draft model for sampling * Fix dumb mistake * Allow for later extensions to the LlamaDraftModel api * Cleanup * Adaptive candidate prediction * Update implementation to match hf transformers * Tuning * Fix bug where last token was not used for ngram prediction * Remove heuristic for num_pred_tokens (no benefit) * fix: n_candidates bug. * Add draft_model_num_pred_tokens server setting * Cleanup * Update README	2024-01-31 14:08:14 -05:00
Andrei	da003d8768	Automatically set chat format from gguf (#1110 ) * Use jinja formatter to load chat format from gguf * Fix off-by-one error in metadata loader * Implement chat format auto-detection	2024-01-29 14:22:23 -05:00
Andrei Betlen	9677a1f2c8	fix: Check order	2024-01-23 22:28:03 -05:00
Andrei Betlen	4d6b2f7b91	fix: format	2024-01-23 22:08:27 -05:00
Phil H	fe5d6ea648	fix: GGUF metadata KV overrides, re #1011 (#1116 ) * kv overrides another attempt * add sentinel element, simplify array population * ensure sentinel element is zeroed	2024-01-23 22:00:38 -05:00
Andrei Betlen	5a34c57e54	feat: Expose gguf model metadata in metadata property	2024-01-19 10:46:03 -05:00
Andrei Betlen	3babe3512c	Fix mirostat sampling	2024-01-19 08:31:59 -05:00
Andrei Betlen	48c3b77e6f	Offload KQV by default	2024-01-18 11:08:57 -05:00
Andrei Betlen	7b46bb5a78	Re-order classes in llama.py	2024-01-17 09:16:13 -05:00
Andrei Betlen	cc4630e66f	Move helper classes to _internals submodule	2024-01-17 09:14:00 -05:00
Andrei Betlen	3b92419132	Move cache classes to llama_cache submodule.	2024-01-17 09:09:12 -05:00
Andrei Betlen	84615adbc6	Add split_mode option. Closes #1085	2024-01-15 12:49:20 -05:00
Phil H	76aafa6149	Implement GGUF metadata KV overrides (#1011 ) * Implement GGUF metadata overrides * whitespace fix * Fix kv overrides. * Fix pointer and pickle * Match llama.cpp kv_overrides cli argument --------- Co-authored-by: Andrei <abetlen@gmail.com>	2024-01-15 12:29:29 -05:00
Stephen Hankinson	df3be58d6c	Add ability to pass in penalize_nl param (#1068 )	2024-01-10 02:46:27 -05:00
Andrei Betlen	d9a1d90fd7	Fix typo	2023-12-22 15:12:27 -05:00
swg	4b01a873ef	server: Support none defaulting to infinity for completions (#111 ) * Support defaulting to infinity or -1 for chat completions * Check if completion_tokens is none in error handler. * fix: max_tokens in create completion should match openai spec * Fix __call__ --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2023-12-22 14:05:13 -05:00
twaka	2f03fb0231	fix text_offset of multi-token characters (#1037 ) * fix text_offsets for bytes tokens * fix	2023-12-22 00:03:29 -05:00
Andrei Betlen	a05b4da80a	fix: float32 is not JSON serializable when streaming logits.	2023-12-18 18:40:36 -05:00
Andrei Betlen	095c650006	Add offload_kqv option to llama and server	2023-12-18 15:36:09 -05:00
Andrei Betlen	472b344ae3	Remove unnused import	2023-12-18 15:32:40 -05:00
kddubey	6b2e0e05b4	perf: Don't convert logprobs arrays to lists (#1021 )	2023-12-18 14:28:12 -05:00
Brandon Roberts	62944df142	Bugfix: Remove f16_kv, add offload_kqv field (#1019 ) F16_KV appears to have been removed here: `af99c6fbfc` This addresses two issues: - #995 which just requests to add the KV cache offloading param - #1006 a NULL ptr exception when using the embeddings (introduced by leaving f16_kv in the fields struct)	2023-12-18 14:27:11 -05:00
Daniele Morotti	f1c631dc53	Bug fixed with n_ctx=0 (#1015 ) If the n_ctx is set to 0 the code should use the maximum context length of the selected model, but it didn't work. There was a problem with the initialization of this parameter and a related problem with 'n_batch'.	2023-12-16 18:59:50 -05:00
kddubey	5a8944672f	Fix logits_to_logprobs for 2-D and 3-D logits (#1002 ) * Fix logits_to_logprobs for 2-D and 3-D logits * Set dtype to single * Test size	2023-12-16 18:59:26 -05:00
Tanner Hobson	ef22e478db	Replace logits_to_logprobs implementation with numpy equivalent to llama.cpp (#991 ) See #990. This change makes the logits_to_logprobs function equivalent to the version in the llama.cpp repository. It uses numpy so it's much faster than the previous version.	2023-12-11 20:46:27 -05:00
Andrei Betlen	ec26f364cc	Remove f16_kv	2023-12-11 10:25:37 -05:00
kddubey	b069d06346	Fix #891 (#952 )	2023-11-29 05:39:52 -05:00

1 2 3 4 5 ...

311 commits