Andrei Betlen
e4647c75ec
Add use_mmap flag to server
2023-04-19 15:57:46 -04:00
Andrei Betlen
92c077136d
Add experimental cache
2023-04-15 12:03:09 -04:00
Andrei Betlen
6c7cec0c65
Fix completion request
2023-04-14 10:01:15 -04:00
Andrei Betlen
4f5f99ef2a
Formatting
2023-04-12 22:40:12 -04:00
Andrei Betlen
0daf16defc
Enable logprobs on completion endpoint
2023-04-12 19:08:11 -04:00
Andrei Betlen
19598ac4e8
Fix threading bug. Closes #62
2023-04-12 19:07:53 -04:00
Andrei Betlen
b3805bb9cc
Implement logprobs parameter for text completion. Closes #2
2023-04-12 14:05:11 -04:00
Andrei Betlen
213cc5c340
Remove async from function signature to avoid blocking the server
2023-04-11 11:54:31 -04:00
Andrei Betlen
0067c1a588
Formatting
2023-04-08 16:01:18 -04:00
Andrei Betlen
da539cc2ee
Safer calculation of default n_threads
2023-04-06 21:22:19 -04:00
Andrei Betlen
930db37dd2
Merge branch 'main' of github.com:abetlen/llama_cpp_python into main
2023-04-06 21:07:38 -04:00
Andrei Betlen
55279b679d
Handle prompt list
2023-04-06 21:07:35 -04:00
MillionthOdin16
c283edd7f2
Set n_batch to default values and reduce thread count:
...
Change batch size to the llama.cpp default of 8. I've seen issues in llama.cpp where batch size affects quality of generations. (It shouldn't) But in case that's still an issue I changed to default.
Set auto-determined num of threads to 1/2 system count. ggml will sometimes lock cores at 100% while doing nothing. This is being addressed, but can cause bad experience for user if pegged at 100%
2023-04-05 18:17:29 -04:00
MillionthOdin16
76a82babef
Set n_batch to the default value of 8. I think this is leftover from when n_ctx was missing and n_batch was 2048.
2023-04-05 17:44:53 -04:00
Andrei Betlen
44448fb3a8
Add server as a subpackage
2023-04-05 16:23:25 -04:00