.. | ||
cuda_simple | ||
open_llama | ||
openblas_simple | ||
README.md |
Install Docker Server
Note #1: This was tested with Docker running on Linux. If you can get it working on Windows or MacOS, please update this README.md
with a PR!
Note #2: NVidia GPU CuBLAS support requires a NVidia GPU with sufficient VRAM (approximately as much as the size above) and Docker NVidia support (see container-toolkit/install-guide)
Simple Dockerfiles for building the llama-cpp-python server with external model bin files
./openblas_simple/Dockerfile
- a simple Dockerfile for non-GPU OpenBLAS, where the model is located outside the Docker imagecd ./openblas_simple
docker build -t openblas_simple .
docker run -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t openblas_simple
where<model-root-path>/<model-path>
is the full path to the model file on the Docker host system../cuda_simple/Dockerfile
- a simple Dockerfile for CUDA accelerated CuBLAS, where the model is located outside the Docker imagecd ./cuda_simple
docker build -t cuda_simple .
docker run -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t cuda_simple
where<model-root-path>/<model-path>
is the full path to the model file on the Docker host system.
"Bot-in-a-box" - a method to build a Docker image by choosing a model to be downloaded and loading into a Docker image
cd ./auto_docker
:hug_model.py
- a Python utility for interactively choosing and downloading the latest5_1
quantized models from huggingface.co/TheBlokeDockerfile
- a single OpenBLAS and CuBLAS combined Dockerfile that automatically installs a previously downloaded modelmodel.bin
Download a Llama Model from Hugging Face
- To download a MIT licensed Llama model you can run:
python3 ./hug_model.py -a vihangd -s open_llama_7b_700bt_ggml -f ggml-model-q5_1.bin
- To select and install a restricted license Llama model run:
python3 ./hug_model.py -a TheBloke -t llama
- You should now have a model in the current directory and
model.bin
symlinked to it for the subsequent Docker build and copy step. e.g.
docker $ ls -lh *.bin
-rw-rw-r-- 1 user user 4.8G May 23 18:30 <downloaded-model-file>q5_1.bin
lrwxrwxrwx 1 user user 24 May 23 18:30 model.bin -> <downloaded-model-file>q5_1.bin
Note #1: Make sure you have enough disk space to download the model. As the model is then copied into the image you will need at least TWICE as much disk space as the size of the model:
Model | Quantized size |
---|---|
3B | 3 GB |
7B | 5 GB |
13B | 10 GB |
33B | 25 GB |
65B | 50 GB |
Note #2: If you want to pass or tune additional parameters, customise ./start_server.sh
before running docker build ...
Use OpenBLAS
Use if you don't have a NVidia GPU. Defaults to python:3-slim-bullseye
Docker base image and OpenBLAS:
Build:
docker build -t openblas .
Run:
docker run --cap-add SYS_RESOURCE -t openblas
Use CuBLAS
Build:
docker build --build-arg IMAGE=nvidia/cuda:12.1.1-devel-ubuntu22.04 -t cublas .
Run:
docker run --cap-add SYS_RESOURCE -t cublas