Bruce MacDonald
09dd2aeff9
GGUF support ( #441 )
2023-09-07 13:55:37 -04:00
Bruce MacDonald
42998d797d
subprocess llama.cpp server ( #401 )
...
* remove c code
* pack llama.cpp
* use request context for llama_cpp
* let llama_cpp decide the number of threads to use
* stop llama runner when app stops
* remove sample count and duration metrics
* use go generate to get libraries
* tmp dir for running llm
2023-08-30 16:35:03 -04:00
Michael Yang
b25dd1795d
allow F16 to use metal
...
warning F16 uses significantly more memory than quantized model so the
standard requires don't apply.
2023-08-26 08:38:48 -07:00
Michael Yang
304f2b6c96
add 34b to mem check
2023-08-26 08:29:21 -07:00
Michael Yang
a894cc792d
model and file type as strings
2023-08-17 12:08:04 -07:00
Michael Yang
e26085b921
close open files
2023-08-14 16:08:06 -07:00
Michael Yang
6de5d032e1
implement loading ggml lora adapters through the modelfile
2023-08-10 09:23:39 -07:00
Michael Yang
d791df75dd
check memory requirements before loading
2023-08-10 09:23:11 -07:00
Michael Yang
020a3b3530
disable gpu for q5_0, q5_1, q8_0 quants
2023-08-10 09:23:11 -07:00
Michael Yang
fccf8d179f
partial decode ggml bin for more info
2023-08-10 09:23:10 -07:00