Jeffrey Morgan
213ffdb548
macos amd64
compatibility fixes
2023-09-05 21:33:31 -04:00
Bruce MacDonald
d18282bfda
metal: add missing barriers for mul-mat ( #469 )
2023-09-05 19:37:13 -04:00
Michael Yang
2bc06565c7
fix empty response
2023-09-05 15:03:24 -07:00
Michael Yang
7b5aefb427
Merge pull request #462 from jmorganca/mxyng/rm-marshal-prompt
...
remove marshalPrompt which is no longer needed
2023-09-05 11:48:41 -07:00
Jeffrey Morgan
7fa6e51686
generate binary dependencies based on GOARCH on macos ( #459 )
2023-09-05 12:53:57 -04:00
Michael Yang
59a705525c
fix not forwarding last token
2023-09-03 17:46:50 -04:00
Michael Yang
5d3f314b0b
remove marshalPrompt which is no longer needed
2023-09-03 17:01:05 -04:00
Bruce MacDonald
f964aea9a2
remove test not applicate to subprocess
2023-08-30 16:36:11 -04:00
Bruce MacDonald
42998d797d
subprocess llama.cpp server ( #401 )
...
* remove c code
* pack llama.cpp
* use request context for llama_cpp
* let llama_cpp decide the number of threads to use
* stop llama runner when app stops
* remove sample count and duration metrics
* use go generate to get libraries
* tmp dir for running llm
2023-08-30 16:35:03 -04:00
Quinn Slack
f4432e1dba
treat stop as stop sequences, not exact tokens ( #442 )
...
The `stop` option to the generate API is a list of sequences that should cause generation to stop. Although these are commonly called "stop tokens", they do not necessarily correspond to LLM tokens (per the LLM's tokenizer). For example, if the caller sends a generate request with `"stop":["\n"]`, then generation should stop on any token containing `\n` (and trim `\n` from the output), not just if the token exactly matches `\n`. If `stop` were interpreted strictly as LLM tokens, then it would require callers of the generate API to know the LLM's tokenizer and enumerate many tokens in the `stop` list.
Fixes https://github.com/jmorganca/ollama/issues/295 .
2023-08-30 11:53:42 -04:00
Michael Yang
7df342a6ea
Merge pull request #421 from jmorganca/mxyng/f16-metal
...
allow F16 to use metal
2023-08-29 06:32:59 -07:00
Michael Yang
e82fcf30c6
Merge pull request #420 from jmorganca/mxyng/34b-mem-check
...
add 34b to mem check
2023-08-26 14:15:52 -07:00
Michael Yang
b25dd1795d
allow F16 to use metal
...
warning F16 uses significantly more memory than quantized model so the
standard requires don't apply.
2023-08-26 08:38:48 -07:00
Michael Yang
304f2b6c96
add 34b to mem check
2023-08-26 08:29:21 -07:00
Jeffrey Morgan
177b69a211
add missing entries for 34B
2023-08-25 18:35:35 -07:00
Michael Yang
7a378f8b66
patch llama.cpp for 34B
2023-08-25 10:06:55 -07:00
Michael Yang
b1cececb8e
add 34b model type
2023-08-24 10:35:44 -07:00
Michael Yang
5ca05c2e88
fix ModelType()
2023-08-18 11:23:38 -07:00
Michael Yang
a894cc792d
model and file type as strings
2023-08-17 12:08:04 -07:00
Michael Yang
4dcf5c3e0b
Merge pull request #349 from jmorganca/close-files
...
close open files
2023-08-14 16:15:58 -07:00
Michael Yang
e26085b921
close open files
2023-08-14 16:08:06 -07:00
Michael Yang
f7b613332c
update llama.cpp
2023-08-14 15:47:00 -07:00
Bruce MacDonald
4b2d366c37
Update llama.go
2023-08-14 12:55:50 -03:00
Bruce MacDonald
56fd4e4ef2
log embedding eval timing
2023-08-14 12:51:31 -03:00
Jeffrey Morgan
22885aeaee
update llama.cpp
to f64d44a
2023-08-12 22:47:15 -04:00
Michael Yang
6ed991c8e2
ggml: fix off by one error
...
remove used Unknown FileType
2023-08-11 10:45:22 -07:00
Michael Yang
6de5d032e1
implement loading ggml lora adapters through the modelfile
2023-08-10 09:23:39 -07:00
Michael Yang
d791df75dd
check memory requirements before loading
2023-08-10 09:23:11 -07:00
Michael Yang
020a3b3530
disable gpu for q5_0, q5_1, q8_0 quants
2023-08-10 09:23:11 -07:00
Michael Yang
fccf8d179f
partial decode ggml bin for more info
2023-08-10 09:23:10 -07:00