* Support SPM infill
* typo--
* one less layer of parenthesis necessary
* new required internals
* manually add bos/eos if model requires it
* add bos even when unknown
This is identical behaviour to llama.cpp
I guess any model that doesn't use BOS is recent enough to have the add_bos_token metadata.
* don't add bos/eos on non-infill pre-tokenized prompt
* add tokenizer hack to remove leading space in suffix
* I keep forgetting metadata are strings
* check if bos exists
* add example
* add cls/sep instead of bos/eos for WPM vocab
* simplify
* color-code filtered suffix
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
This commit "deprecates" the example fastapi server by remaining runnable but pointing folks at the module if they want to learn more.
Rationale:
Currently there exist two server implementations in this repo:
- `llama_cpp/server/__main__.py`, the module that's runnable by consumers of the library with `python3 -m llama_cpp.server`
- `examples/high_level_api/fastapi_server.py`, which is probably a copy-pasted example by folks hacking around
IMO this is confusing. As a new user of the library I see they've both been updated relatively recently but looking side-by-side there's a diff.
The one in the module seems better:
- supports logits_all
- supports use_mmap
- has experimental cache support (with some mutex thing going on)
- some stuff with streaming support was moved around more recently than fastapi_server.py
Change batch size to the llama.cpp default of 8. I've seen issues in llama.cpp where batch size affects quality of generations. (It shouldn't) But in case that's still an issue I changed to default.
Set auto-determined num of threads to 1/2 system count. ggml will sometimes lock cores at 100% while doing nothing. This is being addressed, but can cause bad experience for user if pegged at 100%