Commit graph

141 commits

Author SHA1 Message Date
Andrei Betlen
a0b61ea2a7 Bugfix for models endpoint 2023-05-07 20:17:52 -04:00
Andrei Betlen
14da46f16e Added cache size to settins object. 2023-05-07 19:33:17 -04:00
Andrei Betlen
627811ea83 Add verbose flag to server 2023-05-07 05:09:10 -04:00
Andrei Betlen
3fbda71790 Fix mlock_supported and mmap_supported return type 2023-05-07 03:04:22 -04:00
Andrei Betlen
5a3413eee3 Update cpu_count 2023-05-07 03:03:57 -04:00
Andrei Betlen
1a00e452ea Update settings fields and defaults 2023-05-07 02:52:20 -04:00
Andrei Betlen
86753976c4 Revert "llama_cpp server: delete some ignored / unused parameters"
This reverts commit b47b9549d5.
2023-05-07 02:02:34 -04:00
Andrei Betlen
c382d8f86a Revert "llama_cpp server: mark model as required"
This reverts commit e40fcb0575.
2023-05-07 02:00:22 -04:00
Andrei Betlen
d8fddcce73 Merge branch 'main' of github.com:abetlen/llama_cpp_python into better-server-params-and-fields 2023-05-07 01:54:00 -04:00
Andrei Betlen
24fc38754b Add cli options to server. Closes #37 2023-05-05 12:08:28 -04:00
Lucas Doyle
b9098b0ef7 llama_cpp server: prompt is a string
Not sure why this union type was here but taking a look at llama.py, prompt is only ever processed as a string for completion

This was breaking types when generating an openapi client
2023-05-02 14:47:07 -07:00
Andrei
7ab08b8d10
Merge branch 'main' into better-server-params-and-fields 2023-05-01 22:45:57 -04:00
Andrei Betlen
9eafc4c49a Refactor server to use factory 2023-05-01 22:38:46 -04:00
Lucas Doyle
dbbfc4ba2f llama_cpp server: fix to ChatCompletionRequestMessage
When I generate a client, it breaks because it fails to process the schema of ChatCompletionRequestMessage

These fix that:
- I think `Union[Literal["user"], Literal["channel"], ...]` is the same as Literal["user", "channel", ...]
- Turns out default value `Literal["user"]` isn't JSON serializable, so replace with "user"
2023-05-01 15:38:19 -07:00
Lucas Doyle
fa2a61e065 llama_cpp server: fields for the embedding endpoint 2023-05-01 15:38:19 -07:00
Lucas Doyle
8dcbf65a45 llama_cpp server: define fields for chat completions
Slight refactor for common fields shared between completion and chat completion
2023-05-01 15:38:19 -07:00
Lucas Doyle
978b6daf93 llama_cpp server: add some more information to fields for completions 2023-05-01 15:38:19 -07:00
Lucas Doyle
a5aa6c1478 llama_cpp server: add missing top_k param to CreateChatCompletionRequest
`llama.create_chat_completion` definitely has a `top_k` argument, but its missing from `CreateChatCompletionRequest`. decision: add it
2023-05-01 15:38:19 -07:00
Lucas Doyle
1e42913599 llama_cpp server: move logprobs to supported
I think this is actually supported (its in the arguments of `LLama.__call__`, which is how the completion is invoked). decision: mark as supported
2023-05-01 15:38:19 -07:00
Lucas Doyle
b47b9549d5 llama_cpp server: delete some ignored / unused parameters
`n`, `presence_penalty`, `frequency_penalty`, `best_of`, `logit_bias`, `user`: not supported, excluded from the calls into llama. decision: delete it
2023-05-01 15:38:19 -07:00
Lucas Doyle
e40fcb0575 llama_cpp server: mark model as required
`model` is ignored, but currently marked "optional"... on the one hand could mark "required" to make it explicit in case the server supports multiple llama's at the same time, but also could delete it since its ignored. decision: mark it required for the sake of openai api compatibility.

I think out of all parameters, `model` is probably the most important one for people to keep using even if its ignored for now.
2023-05-01 15:38:19 -07:00
Andrei Betlen
9ff9cdd7fc Fix import error 2023-05-01 15:11:15 -04:00
Lucas Doyle
efe8e6f879 llama_cpp server: slight refactor to init_llama function
Define an init_llama function that starts llama with supplied settings instead of just doing it in the global context of app.py

This allows the test to be less brittle by not needing to mess with os.environ, then importing the app
2023-04-29 11:42:23 -07:00
Lucas Doyle
6d8db9d017 tests: simple test for server module 2023-04-29 11:42:20 -07:00
Lucas Doyle
468377b0e2 llama_cpp server: app is now importable, still runnable as a module 2023-04-29 11:41:25 -07:00
Andrei Betlen
3cab3ef4cb Update n_batch for server 2023-04-25 09:11:32 -04:00
Andrei Betlen
e4647c75ec Add use_mmap flag to server 2023-04-19 15:57:46 -04:00
Andrei Betlen
92c077136d Add experimental cache 2023-04-15 12:03:09 -04:00
Andrei Betlen
6c7cec0c65 Fix completion request 2023-04-14 10:01:15 -04:00
Andrei Betlen
4f5f99ef2a Formatting 2023-04-12 22:40:12 -04:00
Andrei Betlen
0daf16defc Enable logprobs on completion endpoint 2023-04-12 19:08:11 -04:00
Andrei Betlen
19598ac4e8 Fix threading bug. Closes #62 2023-04-12 19:07:53 -04:00
Andrei Betlen
b3805bb9cc Implement logprobs parameter for text completion. Closes #2 2023-04-12 14:05:11 -04:00
Andrei Betlen
213cc5c340 Remove async from function signature to avoid blocking the server 2023-04-11 11:54:31 -04:00
Andrei Betlen
0067c1a588 Formatting 2023-04-08 16:01:18 -04:00
Andrei Betlen
da539cc2ee Safer calculation of default n_threads 2023-04-06 21:22:19 -04:00
Andrei Betlen
930db37dd2 Merge branch 'main' of github.com:abetlen/llama_cpp_python into main 2023-04-06 21:07:38 -04:00
Andrei Betlen
55279b679d Handle prompt list 2023-04-06 21:07:35 -04:00
MillionthOdin16
c283edd7f2 Set n_batch to default values and reduce thread count:
Change batch size to the llama.cpp default of 8. I've seen issues in llama.cpp where batch size affects quality of generations. (It shouldn't) But in case that's still an issue I changed to default.

Set auto-determined num of threads to 1/2 system count. ggml will sometimes lock cores at 100% while doing nothing. This is being addressed, but can cause bad experience for user if pegged at 100%
2023-04-05 18:17:29 -04:00
MillionthOdin16
76a82babef Set n_batch to the default value of 8. I think this is leftover from when n_ctx was missing and n_batch was 2048. 2023-04-05 17:44:53 -04:00
Andrei Betlen
44448fb3a8 Add server as a subpackage 2023-04-05 16:23:25 -04:00