The Digest type in its current form is awkward to work with and presents
challenges with regard to how it serializes via String using the '-'
prefix.
We currently only use this in ollama.com, so we'll move our specific
needs around digest parsing and validation there.
The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.
Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.
On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off. This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.
We update the PATH on windows to get the CLI mapped, but this has
an unintended side effect of causing other apps that may use our bundled
DLLs to get terminated when we upgrade.
* docs: add missing instruction for powershell build
The powershell script for building Ollama on Windows now requires the `ThreadJob` module. Add this to the instructions and dependency list.
* Update development.md
This implements the release logic we want via gh cli
to support updating releases with rc tags in place and retain
release notes and other community reactions.
While models are loading, the VRAM metrics are dynamic, so try
to load on a GPU that doesn't have a model actively loading, or wait
to avoid races that lead to OOMs
Our default behavior today is to try to fit into a single GPU if possible.
Some users would prefer the old behavior of always spreading across
multiple GPUs even if the model can fit into one. This exposes that
tunable behavior.
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block