- remove ggml runner
- automatically pull gguf models when ggml detected
- tell users to update to gguf in the case automatic pull fails
Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>
previous layer creation was not ideal because:
1. it required reading the input file multiple times, once to calculate
the sha256 checksum, another to write it to disk, and potentially one
more to decode the underlying gguf
2. used io.ReadSeeker which is prone to user error. if the file isn't
reset correctly or in the right place, it could end up reading an
empty file
there are also some brittleness when reading existing layers else
writing the inherited layers will error reading an already closed file
this commit aims to fix these issues by restructuring layer creation.
1. it will now write the layer to a temporary file as well as the hash
function and move it to the final location on Commit
2. layers are read once once when copied to the destination. exception
is raw model files which still requires a second read to decode the
model metadata
- update chat docs
- add messages chat endpoint
- remove deprecated context and template generate parameters from docs
- context and template are still supported for the time being and will continue to work as expected
- add partial response to chat history
* allow for a configurable ollama models directory
- set OLLAMA_MODELS in the environment that ollama is running in to change where model files are stored
- update docs
Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>
Co-Authored-By: Jay Nakrani <dhananjaynakrani@gmail.com>
Co-Authored-By: Akhil Acharya <akhilcacharya@gmail.com>
Co-Authored-By: Sasha Devol <sasha.devol@protonmail.com>
- only reload the running llm if the model has changed, or the options for loading the running model have changed
- rename loaded llm to runner to differentiate from loaded model image
- remove logic which keeps the first system prompt in the generation context
* remove tmp directories created by previous servers
* clean up on server stop
* Update routes.go
* Update server/routes.go
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* create top-level temp ollama dir
* check file exists before creating
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
parameters are not inherited because they are processed differently from
other layer. fix this by explicitly merging the inherited params into
the new params. parameter values defined in the new modelfile will
override those defined in the inherited modelfile. array lists are
replaced instead of appended
The `html/template` package automatically HTML-escapes interpolated strings in templates. This behavior is undesirable because it causes prompts like `<h1>hello` to be escaped to `<h1>hello` before being passed to the LLM.
The included test case passes, but before the code change, it failed:
```
--- FAIL: TestModelPrompt
images_test.go:21: got "a<h1>b", want "a<h1>b"
```
* remove c code
* pack llama.cpp
* use request context for llama_cpp
* let llama_cpp decide the number of threads to use
* stop llama runner when app stops
* remove sample count and duration metrics
* use go generate to get libraries
* tmp dir for running llm
The `stop` option to the generate API is a list of sequences that should cause generation to stop. Although these are commonly called "stop tokens", they do not necessarily correspond to LLM tokens (per the LLM's tokenizer). For example, if the caller sends a generate request with `"stop":["\n"]`, then generation should stop on any token containing `\n` (and trim `\n` from the output), not just if the token exactly matches `\n`. If `stop` were interpreted strictly as LLM tokens, then it would require callers of the generate API to know the LLM's tokenizer and enumerate many tokens in the `stop` list.
Fixes https://github.com/jmorganca/ollama/issues/295.
The token printed for authorized requests has a lifetime of 1h. If an
upload exceeds 1h, a chunk push will fail since the token is created on
a "start upload" request.
This replaces the Pipe with SectionReader which is simpler and
implements Seek, a requirement for makeRequestWithRetry. This is
slightly worse than using a Pipe since the progress update is directly
tied to the chunk size instead of controlled separately.