Commit graph

60 commits

Author SHA1 Message Date
Bruce MacDonald
09dd2aeff9
GGUF support (#441) 2023-09-07 13:55:37 -04:00
Bruce MacDonald
42998d797d
subprocess llama.cpp server (#401)
* remove c code
* pack llama.cpp
* use request context for llama_cpp
* let llama_cpp decide the number of threads to use
* stop llama runner when app stops
* remove sample count and duration metrics
* use go generate to get libraries
* tmp dir for running llm
2023-08-30 16:35:03 -04:00
Michael Yang
b25dd1795d allow F16 to use metal
warning F16 uses significantly more memory than quantized model so the
standard requires don't apply.
2023-08-26 08:38:48 -07:00
Michael Yang
304f2b6c96 add 34b to mem check 2023-08-26 08:29:21 -07:00
Michael Yang
a894cc792d model and file type as strings 2023-08-17 12:08:04 -07:00
Michael Yang
e26085b921 close open files 2023-08-14 16:08:06 -07:00
Michael Yang
6de5d032e1 implement loading ggml lora adapters through the modelfile 2023-08-10 09:23:39 -07:00
Michael Yang
d791df75dd check memory requirements before loading 2023-08-10 09:23:11 -07:00
Michael Yang
020a3b3530 disable gpu for q5_0, q5_1, q8_0 quants 2023-08-10 09:23:11 -07:00
Michael Yang
fccf8d179f partial decode ggml bin for more info 2023-08-10 09:23:10 -07:00