llama.cpp/llama_cpp/server
Andrei fb762a6041
Add speculative decoding (#1120)
* Add draft model param to llama class, implement basic prompt lookup decoding draft model

* Use samplingcontext for sampling

* Use 1d array

* Use draft model for sampling

* Fix dumb mistake

* Allow for later extensions to the LlamaDraftModel api

* Cleanup

* Adaptive candidate prediction

* Update implementation to match hf transformers

* Tuning

* Fix bug where last token was not used for ngram prediction

* Remove heuristic for num_pred_tokens (no benefit)

* fix: n_candidates bug.

* Add draft_model_num_pred_tokens server setting

* Cleanup

* Update README
2024-01-31 14:08:14 -05:00
..
__init__.py llama_cpp server: app is now importable, still runnable as a module 2023-04-29 11:41:25 -07:00
__main__.py [Feat] Multi model support (#931) 2023-12-22 05:51:25 -05:00
app.py feat(server): include llama-cpp-python version in openapi spec 2024-01-25 11:23:18 -05:00
cli.py Fix python3.8 support 2024-01-19 08:17:49 -05:00
errors.py server: Support none defaulting to infinity for completions (#111) 2023-12-22 14:05:13 -05:00
model.py Add speculative decoding (#1120) 2024-01-31 14:08:14 -05:00
settings.py Add speculative decoding (#1120) 2024-01-31 14:08:14 -05:00
types.py server: Support none defaulting to infinity for completions (#111) 2023-12-22 14:05:13 -05:00