llama.cpp

History

Andrei fb762a6041 Add speculative decoding (#1120 ) * Add draft model param to llama class, implement basic prompt lookup decoding draft model * Use samplingcontext for sampling * Use 1d array * Use draft model for sampling * Fix dumb mistake * Allow for later extensions to the LlamaDraftModel api * Cleanup * Adaptive candidate prediction * Update implementation to match hf transformers * Tuning * Fix bug where last token was not used for ngram prediction * Remove heuristic for num_pred_tokens (no benefit) * fix: n_candidates bug. * Add draft_model_num_pred_tokens server setting * Cleanup * Update README		2024-01-31 14:08:14 -05:00
..
__init__.py	llama_cpp server: app is now importable, still runnable as a module	2023-04-29 11:41:25 -07:00
__main__.py	[Feat] Multi model support (#931 )	2023-12-22 05:51:25 -05:00
app.py	feat(server): include llama-cpp-python version in openapi spec	2024-01-25 11:23:18 -05:00
cli.py	Fix python3.8 support	2024-01-19 08:17:49 -05:00
errors.py	server: Support none defaulting to infinity for completions (#111 )	2023-12-22 14:05:13 -05:00
model.py	Add speculative decoding (#1120 )	2024-01-31 14:08:14 -05:00
settings.py	Add speculative decoding (#1120 )	2024-01-31 14:08:14 -05:00
types.py	server: Support none defaulting to infinity for completions (#111 )	2023-12-22 14:05:13 -05:00