3478b2cf14
If there are no avilable slots for new sequences then a request will not be added to the processing queue but will continue on to wait for a response that never comes. Besides never giving a response to the request, this prevents the model from being unloaded due to the outstanding request. To prevent this, there are semaphores that prevent more requests from being processed than there are slots - one in the Ollama server and one in the runner. - The Ollama server one works but it is not designed to protect the runner's data internal structures and the runner can return a final response before clearing its data structures. - The internal runner semaphore has similar behavior where it can release the semaphore when it issues a response. This is wrong - it should only release the semaphore after it has cleared the data structure. In addition, we should return an error if a slot is not found rather than deadlocking in the event we ever get to this spot. Fixes #7779 |
||
---|---|---|
.. | ||
ggml-cuda | ||
llamafile | ||
make | ||
patches | ||
runner | ||
.gitignore | ||
base64.hpp | ||
build-info.cpp | ||
clip.cpp | ||
clip.h | ||
common.cpp | ||
common.h | ||
ggml-aarch64.c | ||
ggml-aarch64.h | ||
ggml-alloc.c | ||
ggml-alloc.h | ||
ggml-backend-impl.h | ||
ggml-backend.c | ||
ggml-backend.h | ||
ggml-blas.cpp | ||
ggml-blas.h | ||
ggml-common.h | ||
ggml-cpu-impl.h | ||
ggml-cuda.cu | ||
ggml-cuda.h | ||
ggml-impl.h | ||
ggml-metal.h | ||
ggml-metal.metal | ||
ggml-metal_darwin_arm64.m | ||
ggml-quants.c | ||
ggml-quants.h | ||
ggml.c | ||
ggml.h | ||
json-schema-to-grammar.cpp | ||
json-schema-to-grammar.h | ||
json.hpp | ||
llama-grammar.cpp | ||
llama-grammar.h | ||
llama-impl.h | ||
llama-sampling.cpp | ||
llama-sampling.h | ||
llama-vocab.cpp | ||
llama-vocab.h | ||
llama.cpp | ||
llama.go | ||
llama.h | ||
llama_darwin.c | ||
llama_darwin.go | ||
llama_test.go | ||
llava.cpp | ||
llava.h | ||
log.cpp | ||
log.h | ||
Makefile | ||
mllama.cpp | ||
mllama.h | ||
README.md | ||
sampling.cpp | ||
sampling.h | ||
sampling_ext.cpp | ||
sampling_ext.h | ||
sgemm.cpp | ||
sgemm.h | ||
stb_image.h | ||
unicode-data.cpp | ||
unicode-data.h | ||
unicode.cpp | ||
unicode.h | ||
vendoring |
llama
This package integrates the llama.cpp library as a Go package and makes it easy to build it with tags for different CPU and GPU processors.
Supported:
- CPU
- avx, avx2
- macOS Metal
- Windows CUDA
- Windows ROCm
- Linux CUDA
- Linux ROCm
- Llava
Extra build steps are required for CUDA and ROCm on Windows since nvcc
and hipcc
both require using msvc as the host compiler. For these shared libraries are created:
ggml_cuda.dll
on Windows orggml_cuda.so
on Linuxggml_hipblas.dll
on Windows orggml_hipblas.so
on Linux
Note: it's important that memory is allocated and freed by the same compiler (e.g. entirely by code compiled with msvc or mingw). Issues from this should be rare, but there are some places where pointers are returned by the CUDA or HIP runtimes and freed elsewhere, causing a a crash. In a future change the same runtime should be used in both cases to avoid crashes.
Building
go build .
AVX
go build -tags avx .
AVX2
# go doesn't recognize `-mfma` as a valid compiler flag
# see https://github.com/golang/go/issues/17895
go env -w "CGO_CFLAGS_ALLOW=-mfma|-mf16c"
go env -w "CGO_CXXFLAGS_ALLOW=-mfma|-mf16c"
go build -tags=avx,avx2 .
Linux
CUDA
Install the CUDA toolkit v11.3.1:
make ggml_cuda.so
go build -tags avx,cuda .
ROCm
Install ROCm.
make ggml_hipblas.so
go build -tags avx,rocm .
Windows
Download w64devkit for a simple MinGW development environment.
CUDA
Install the CUDA toolkit v11.3.1 then build the cuda code:
make ggml_cuda.dll
go build -tags avx,cuda .
ROCm
Install ROCm.
make ggml_hipblas.dll
go build -tags avx,rocm .
Building runners
# build all runners for this platform
make -j
Vendoring
Ollama currently vendors llama.cpp and ggml through a vendoring model. While we generally strive to contribute changes back upstream to avoid drift, we cary a small set of patches which are applied to the tracking commit. A set of make targets are available to aid developers in updating to a newer tracking commit, or to work on changes.
If you update the vendoring code, start by running the following command to establish the tracking llama.cpp repo in the ./vendor/
directory.
make apply-patches
Updating Base Commit
Pin to new base commit
To update to a newer base commit, select the upstream git tag or commit and update llama/vendoring.env
Applying patches
When updating to a newer base commit, the existing patches may not apply cleanly and require manual merge resolution.
Start by applying the patches. If any of the patches have conflicts, the git am
will stop at the first failure.
make apply-patches
If you see an error message about a conflict, go into the ./vendor/
directory, and perform merge resolution using your preferred tool to the patch commit which failed. Save the file(s) and continue the patch series with git am --continue
. If any additional patches fail, follow the same pattern until the full patch series is applied. Once finished, run a final create-patches
and sync
target to ensure everything is updated.
make create-patches sync
Build and test Ollama, and make any necessary changes to the Go code based on the new base commit. Submit your PR to the Ollama repo.
Generating Patches
When working on new fixes or features that impact vendored code, use the following model. First get a clean tracking repo with all current patches applied:
make apply-patches
Now edit the upstream native code in the ./vendor/
directory. You do not need to commit every change in order to build, a dirty working tree in the tracking repo is OK while developing. Simply save in your editor, and run the following to refresh the vendored code with your changes, build the backend(s) and build ollama:
make sync
make -j 8
go build .
Important
Do NOT run
apply-patches
while you're iterating as that will reset the tracking repo. It will detect a dirty tree and abort, but if your tree is clean and you accidentally ran this target, usegit reflog
to recover your commit(s).
Iterate until you're ready to submit PRs. Once your code is ready, commit a change in the ./vendor/
directory, then generate the patches for ollama with
make create-patches
Important
Once you have completed this step, it is safe to run
apply-patches
since your change is preserved in the patches.
In your ./vendor/
directory, create a branch, and cherry-pick the new commit to that branch, then submit a PR upstream to llama.cpp.
Commit the changes in the ollama repo and submit a PR to Ollama, which will include the vendored code update with your change, along with the patches.
After your PR upstream is merged, follow the Updating Base Commit instructions above, however first remove your patch before running apply-patches
since the new base commit contains your change already.