ollama/docs/import.md

3.7 KiB
Raw Blame History

Import a model

This guide walks through importing a PyTorch, Safetensors or GGUF model from a HuggingFace repo to Ollama.

Supported models

Ollama supports a set of model architectures, with support for more coming soon:

  • Llama & Mistral
  • Falcon & RW
  • GPT-NeoX
  • BigCode

To view a model's architecture, check the config.json file in its HuggingFace repo. You should see an entry under architectures (e.g. LlamaForCausalLM).

Importing

Step 1: Clone the HuggingFace repository

git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
cd Mistral-7B-Instruct-v0.1

Step 2: Convert and quantize (for PyTorch and Safetensors)

A Docker image with the tooling required to convert and quantize models is available.

First, Install Docker.

Next, to convert and quantize your model, run:

docker run --rm -v .:/model ollama/quantize -q q4_0 /model

This will output two files into the directory:

  • f16.bin: the model converted to GGUF
  • q4_0.bin the model quantized to a 4-bit quantization (we will use this file to create the Ollama model)

Step 3: Write a Modelfile

Next, create a Modelfile for your model. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more.

FROM ./q4_0.bin

(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the TEMPLATE instruction in the Modelfile:

FROM ./q4_0.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"

Step 4: Create the Ollama model

Finally, create a model from your Modelfile:

ollama create example -f Modelfile

Next, test the model with ollama run:

ollama run example "What is your favourite condiment?"

Step 5: Publish your model (optional early alpha)

Publishing models is in early alpha. If you'd like to publish your model to share with others, follow these steps:

  1. Create an account
  2. Run cat ~/.ollama/id_ed25519.pub to view your Ollama public key. Copy this to the clipboard.
  3. Add your public key to your Ollama account

Next, copy your model to your username's namespace:

ollama cp example <your username>/example

Then push the model:

ollama push <your username>/example

After publishing, your model will be available at https://ollama.ai/<your username>/example.

Quantization reference

The quantization options are as follow (from highest highest to lowest levels of quantization). Note: some architectures such as Falcon do not support K quants.

  • q2_K
  • q3_K
  • q3_K_S
  • q3_K_M
  • q3_K_L
  • q4_0 (recommended)
  • q4_1
  • q4_K
  • q4_K_S
  • q4_K_M
  • q5_0
  • q5_1
  • q5_K
  • q5_K_S
  • q5_K_M
  • q6_K
  • q8_0

Manually converting & quantizing models

Prerequisites

Start by cloning the llama.cpp repo to your machine in another directory:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Next, install the Python dependencies:

pip install -r requirements.txt

Finally, build the quantize tool:

make quantize

Convert the model

Run the correct conversion script for your model architecture:

# LlamaForCausalLM or MistralForCausalLM
python convert.py <path to model directory>

# FalconForCausalLM
python convert-falcon-hf-to-gguf.py <path to model directory>

# GPTNeoXForCausalLM
python convert-falcon-hf-to-gguf.py <path to model directory>

# GPTBigCodeForCausalLM
python convert-starcoder-hf-to-gguf.py <path to model directory>

Quantize the model

quantize <path to model dir>/ggml-model-f32.bin <path to model dir>/q4_0.bin q4_0