baalajimaestro/llama.cpp

Author	SHA1	Message	Date
Sigbjørn Skjæret	dbcf64cf07	feat: Support SPM infill (#1492 ) * Support SPM infill * typo-- * one less layer of parenthesis necessary * new required internals * manually add bos/eos if model requires it * add bos even when unknown This is identical behaviour to llama.cpp I guess any model that doesn't use BOS is recent enough to have the add_bos_token metadata. * don't add bos/eos on non-infill pre-tokenized prompt * add tokenizer hack to remove leading space in suffix * I keep forgetting metadata are strings * check if bos exists * add example * add cls/sep instead of bos/eos for WPM vocab * simplify * color-code filtered suffix --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-06-13 03:45:24 -04:00
Andrei Betlen	2b37d8e438	fix: Run server command. Closes #1143	2024-01-31 10:37:19 -05:00
Lucas Doyle	0fcc25cdac	examples fastapi_server: deprecate This commit "deprecates" the example fastapi server by remaining runnable but pointing folks at the module if they want to learn more. Rationale: Currently there exist two server implementations in this repo: - `llama_cpp/server/__main__.py`, the module that's runnable by consumers of the library with `python3 -m llama_cpp.server` - `examples/high_level_api/fastapi_server.py`, which is probably a copy-pasted example by folks hacking around IMO this is confusing. As a new user of the library I see they've both been updated relatively recently but looking side-by-side there's a diff. The one in the module seems better: - supports logits_all - supports use_mmap - has experimental cache support (with some mutex thing going on) - some stuff with streaming support was moved around more recently than fastapi_server.py	2023-05-01 22:34:23 -07:00
Andrei Betlen	196650ccb2	Update model paths to be more clear they should point to file	2023-04-09 22:45:55 -04:00
MillionthOdin16	c283edd7f2	Set n_batch to default values and reduce thread count: Change batch size to the llama.cpp default of 8. I've seen issues in llama.cpp where batch size affects quality of generations. (It shouldn't) But in case that's still an issue I changed to default. Set auto-determined num of threads to 1/2 system count. ggml will sometimes lock cores at 100% while doing nothing. This is being addressed, but can cause bad experience for user if pegged at 100%	2023-04-05 18:17:29 -04:00
Andrei Betlen	e1b5b9bb04	Update fastapi server example	2023-04-05 14:44:26 -04:00
Andrei Betlen	c8e13a78d0	Re-organize examples folder	2023-04-05 04:10:13 -04:00

7 commits