Andrei Betlen
0318702cdc
feat(server): Add support for setting root_path. Closes #1420
2024-05-05 12:49:31 -04:00
Andrei Betlen
f9b7221c8f
Merge branch 'main' of github.com:abetlen/llama_cpp_python into main
2024-05-03 19:07:54 -04:00
Andrei Betlen
0a454bebe6
feat(server): Remove temperature bounds checks for server. Closes #1384
2024-05-03 15:23:06 -04:00
Daniel Thuerck
2138561fab
fix(server): Propagate flash_attn
to model load. ( #1424 )
2024-05-03 12:17:07 -04:00
Andrei Betlen
31b1d95a6c
feat: Add llama-3-vision-alpha chat format
2024-05-02 11:32:18 -04:00
Andrei Betlen
22d77eefd2
feat: Add option to enable flash_attn
to Lllama params and ModelSettings
2024-04-30 09:29:16 -04:00
Andrei
fe2da09538
feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) ( #1147 )
...
* Test dummy image tags in chat templates
* Format and improve types for llava_cpp.py
* Add from_pretrained support to llava chat format.
* Refactor llava chat format to use a jinja2
* Revert chat format test
* Add moondream support (wip)
* Update moondream chat format
* Update moondream chat format
* Update moondream prompt
* Add function calling support
* Cache last image embed
* Add Llava1.6 support
* Add nanollava support
* Add obisidian support
* Remove unnecessary import
* Re-order multimodal chat formats
* Logits all no longer required for multi-modal models
* Update README.md
* Update docs
* Update README
* Fix typo
* Update README
* Fix typo
2024-04-30 01:35:38 -04:00
Andrei Betlen
fcfea66857
fix: pydantic deprecation warning
2024-04-25 21:21:48 -04:00
Sean Bailey
53ebcc8bb5
feat(server): Provide ability to dynamically allocate all threads if desired using -1
( #1364 )
2024-04-23 02:35:38 -04:00
khimaros
b73c73c0c6
feat: add disable_ping_events
flag ( #1257 )
...
for backward compatibility, this is false by default
it can be set to true to disable EventSource pings
which are not supported by some OpenAI clients.
fixes https://github.com/abetlen/llama-cpp-python/issues/1256
2024-04-17 10:08:19 -04:00
ddh0
c96b2daebf
feat: Use all available CPUs for batch processing ( #1345 )
2024-04-17 10:05:54 -04:00
Andrei Betlen
060bfa64d5
feat: Add support for yaml based configs
2024-04-10 02:47:01 -04:00
Limour
f165048a69
feat: add support for KV cache quantization options ( #1307 )
...
* add KV cache quantization options
https://github.com/abetlen/llama-cpp-python/discussions/1220
https://github.com/abetlen/llama-cpp-python/issues/1305
* Add ggml_type
* Use ggml_type instead of string for quantization
* Add server support
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2024-04-01 10:19:28 -04:00
windspirit95
aa9f1ae011
feat: Add logprobs support to chat completions ( #1311 )
...
* Add logprobs return in ChatCompletionResponse
* Fix duplicate field
* Set default to false
* Simplify check
* Add server example
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2024-03-31 13:30:13 -04:00
Andrei Betlen
d11ccc3036
fix(server): minor type fixes
2024-03-23 17:14:15 -04:00
Andrei Betlen
f7decc9562
docs: Add chat examples to openapi ui
2024-03-19 10:52:53 -04:00
Felipe Lorenz
c139f8b5d5
feat: Add endpoints for tokenize, detokenize and count tokens ( #1136 )
...
* Add endpoint to count tokens
* Add tokenize and detokenize endpoints
* Change response key to tokens for tokenize endpoint
* Fix dependency bug
* Cleanup
* Remove example added by mistake
* Move tokenize, detokenize, and count to Extras namespace. Tag existing endpoints
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2024-03-08 21:09:00 -05:00
Andrei Betlen
727d60c28a
misc: Format
2024-02-28 14:27:40 -05:00
Andrei Betlen
0d37ce52b1
feat: Update llama.cpp
2024-02-28 14:27:16 -05:00
Andrei
4d574bd765
feat(server): Add support for pulling models from Huggingface Hub ( #1222 )
...
* Basic support for hf pull on server
* Add hf_model_repo_id setting
* Update README
2024-02-26 14:35:08 -05:00
Andrei Betlen
dcf38f6141
fix: remove prematurely commited change
2024-02-25 21:00:37 -05:00
Andrei Betlen
2292af5796
feat: Update llama.cpp
2024-02-25 16:53:58 -05:00
Andrei Betlen
fdce078cb9
feat: Update llama.cpp
2024-02-17 00:37:51 -05:00
khimaros
ea1f88dd29
fix: Use '\n' seperator for EventSourceResponse ( #1188 )
...
this fixes compatibility with some OpenAI clients, including BetterChatGPT (https://github.com/ztjhz/BetterChatGPT/issues/537 ).
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-15 15:20:13 -05:00
Andrei Betlen
85d3374b4d
fix: broken import
2024-02-08 01:13:28 -05:00
Jeffrey Fong
901827013b
feat: Integrate functionary v1.4 and v2 models + add custom tokenizer support to Llama class ( #1078 )
...
* convert functionary-v1 chat handler to use hf autotokenizer
* add hf_tokenizer + inteegrate functionary-v1.4 prompt template
* integrate functionary v2 prompt template
* update readme
* set up parallel function calling wip
* set up parallel function calling
* Update README.md
* Update README.md
* refactor tokenizers
* include old functionary handler for backward compatibility
* add hf_tokenizer_path in server ModelSettings
* convert functionary-v1 chat handler to use hf autotokenizer
* add hf_tokenizer + inteegrate functionary-v1.4 prompt template
* integrate functionary v2 prompt template
* update readme
* set up parallel function calling wip
* resolve merge conflict
* Update README.md
* Update README.md
* refactor tokenizers
* include old functionary handler for backward compatibility
* add hf_tokenizer_path in server ModelSettings
* Cleanup PR, fix breaking changes
* Use hf_pretrained_model_name_or_path for tokenizer
* fix hf tokenizer in streaming
* update README
* refactor offset mapping
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-02-07 20:07:03 -05:00
Andrei
fb762a6041
Add speculative decoding ( #1120 )
...
* Add draft model param to llama class, implement basic prompt lookup decoding draft model
* Use samplingcontext for sampling
* Use 1d array
* Use draft model for sampling
* Fix dumb mistake
* Allow for later extensions to the LlamaDraftModel api
* Cleanup
* Adaptive candidate prediction
* Update implementation to match hf transformers
* Tuning
* Fix bug where last token was not used for ngram prediction
* Remove heuristic for num_pred_tokens (no benefit)
* fix: n_candidates bug.
* Add draft_model_num_pred_tokens server setting
* Cleanup
* Update README
2024-01-31 14:08:14 -05:00
Andrei
da003d8768
Automatically set chat format from gguf ( #1110 )
...
* Use jinja formatter to load chat format from gguf
* Fix off-by-one error in metadata loader
* Implement chat format auto-detection
2024-01-29 14:22:23 -05:00
Andrei Betlen
cde7514c3d
feat(server): include llama-cpp-python version in openapi spec
2024-01-25 11:23:18 -05:00
Andrei Betlen
24f39454e9
fix: pass chat handler not chat formatter for huggingface autotokenizer and tokenizer_config formats.
2024-01-21 18:38:04 -05:00
Andrei Betlen
141293a75b
Fix python3.8 support
2024-01-19 08:17:49 -05:00
Andrei Betlen
b8fc1c7d83
feat: Add ability to load chat format from huggingface autotokenizer or tokenizer_config.json files.
2024-01-18 21:21:37 -05:00
Andrei Betlen
48c3b77e6f
Offload KQV by default
2024-01-18 11:08:57 -05:00
Kyle Mistele
9c36688b33
fix(cli): allow passing n_ctx=0 to openAI API server args to use model n_ctx_train field per #1015 ( #1093 )
2024-01-16 18:54:06 -05:00
anil
cfb7da98ed
Support Accept text/event-stream in chat and completion endpoints, resolves #1083 ( #1088 )
...
Co-authored-by: Anil Pathak <anil@heyday.com>
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2024-01-16 12:52:52 -05:00
Andrei Betlen
84615adbc6
Add split_mode option. Closes #1085
2024-01-15 12:49:20 -05:00
Phil H
76aafa6149
Implement GGUF metadata KV overrides ( #1011 )
...
* Implement GGUF metadata overrides
* whitespace fix
* Fix kv overrides.
* Fix pointer and pickle
* Match llama.cpp kv_overrides cli argument
---------
Co-authored-by: Andrei <abetlen@gmail.com>
2024-01-15 12:29:29 -05:00
Andrei Betlen
522aecb868
docs: add server config docs
2023-12-22 14:37:24 -05:00
swg
4b01a873ef
server: Support none defaulting to infinity for completions ( #111 )
...
* Support defaulting to infinity or -1 for chat completions
* Check if completion_tokens is none in error handler.
* fix: max_tokens in create completion should match openai spec
* Fix __call__
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2023-12-22 14:05:13 -05:00
Dave
12b7f2f4e9
[Feat] Multi model support ( #931 )
...
* Update Llama class to handle chat_format & caching
* Add settings.py
* Add util.py & update __main__.py
* multimodel
* update settings.py
* cleanup
* delete util.py
* Fix /v1/models endpoint
* MultiLlama now iterable, app check-alive on "/"
* instant model init if file is given
* backward compability
* revert model param mandatory
* fix error
* handle individual model config json
* refactor
* revert chathandler/clip_model changes
* handle chat_handler in MulitLlama()
* split settings into server/llama
* reduce global vars
* Update LlamaProxy to handle config files
* Add free method to LlamaProxy
* update arg parsers & install server alias
* refactor cache settings
* change server executable name
* better var name
* whitespace
* Revert "whitespace"
This reverts commit bc5cf51c64a95bfc9926e1bc58166059711a1cd8.
* remove exe_name
* Fix merge bugs
* Fix type annotations
* Fix type annotations
* Fix uvicorn app factory
* Fix settings
* Refactor server
* Remove formatting fix
* Format
* Use default model if not found in model settings
* Fix
* Cleanup
* Fix
* Fix
* Remove unnused CommandLineSettings
* Cleanup
* Support default name for copilot-codex models
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
2023-12-22 05:51:25 -05:00
docmeth02
33cc623346
Implement openai api compatible authentication ( #1010 )
2023-12-21 13:44:49 -05:00
Andrei Betlen
095c650006
Add offload_kqv option to llama and server
2023-12-18 15:36:09 -05:00
Brandon Roberts
62944df142
Bugfix: Remove f16_kv, add offload_kqv field ( #1019 )
...
F16_KV appears to have been removed here: af99c6fbfc
This addresses two issues:
- #995 which just requests to add the KV cache offloading param
- #1006 a NULL ptr exception when using the embeddings (introduced by
leaving f16_kv in the fields struct)
2023-12-18 14:27:11 -05:00
Radoslav Gerganov
8e44a32075
Add support for running the server with SSL ( #994 )
2023-12-11 20:47:11 -05:00
Andrei Betlen
1a7bf2037b
docs: Update openapi endpoint names
2023-11-24 03:39:29 -05:00
Andrei Betlen
128dc4731f
Fix #569
2023-11-21 04:39:05 -05:00
Andrei Betlen
7a3f87846b
Format
2023-11-21 04:02:20 -05:00
Andrei Betlen
07e47f55ba
Add support for logit_bias outside of server api. Closes #827
2023-11-21 03:59:46 -05:00
TK-Master
b8438f70b5
Added support for min_p ( #921 )
...
* Added support for min_p
My small contribution to this great project.
Ref: https://github.com/ggerganov/llama.cpp/pull/3841
Closes: https://github.com/abetlen/llama-cpp-python/issues/911
* Fix for negative temp (sample_softmax)
2023-11-20 23:21:33 -05:00
Andrei Betlen
e7962d2c73
Fix: default max_tokens matches openai api (16 for completion, max length for chat completion)
2023-11-10 02:49:27 -05:00