update the import docs (#6104)
This commit is contained in:
parent
47fa0839b9
commit
ac80010db8
3 changed files with 142 additions and 46 deletions
BIN
docs/images/ollama-keys.png
Normal file
BIN
docs/images/ollama-keys.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 141 KiB |
BIN
docs/images/signup.png
Normal file
BIN
docs/images/signup.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 80 KiB |
188
docs/import.md
188
docs/import.md
|
@ -1,44 +1,129 @@
|
||||||
# Import
|
# Importing a model
|
||||||
|
|
||||||
GGUF models and select Safetensors models can be imported directly into Ollama.
|
## Table of Contents
|
||||||
|
|
||||||
## Import GGUF
|
* [Importing a Safetensors adapter](#Importing-a-fine-tuned-adapter-from-Safetensors-weights)
|
||||||
|
* [Importing a Safetensors model](#Importing-a-model-from-Safetensors-weights)
|
||||||
|
* [Importing a GGUF file](#Importing-a-GGUF-based-model-or-adapter)
|
||||||
|
* [Sharing models on ollama.com](#Sharing-your-model-on-ollama.com)
|
||||||
|
|
||||||
A binary GGUF file can be imported directly into Ollama through a Modelfile.
|
## Importing a fine tuned adapter from Safetensors weights
|
||||||
|
|
||||||
|
First, create a `Modelfile` with a `FROM` command pointing at the base model you used for fine tuning, and an `ADAPTER` command which points to the directory with your Safetensors adapter:
|
||||||
|
|
||||||
```dockerfile
|
```dockerfile
|
||||||
FROM /path/to/file.gguf
|
FROM <base model name>
|
||||||
|
ADAPTER /path/to/safetensors/adapter/directory
|
||||||
```
|
```
|
||||||
|
|
||||||
## Import Safetensors
|
Make sure that you use the same base model in the `FROM` command as you used to create the adapter otherwise you will get erratic results. Most frameworks use different quantization methods, so it's best to use non-quantized (i.e. non-QLoRA) adapters. If your adapter is in the same directory as your `Modelfile`, use `ADAPTER .` to specify the adapter path.
|
||||||
|
|
||||||
If the model being imported is one of these architectures, it can be imported directly into Ollama through a Modelfile:
|
Now run `ollama create` from the directory where the `Modelfile` was created:
|
||||||
|
|
||||||
- LlamaForCausalLM
|
```bash
|
||||||
- MistralForCausalLM
|
ollama create my-model
|
||||||
- MixtralForCausalLM
|
```
|
||||||
- GemmaForCausalLM
|
|
||||||
- Phi3ForCausalLM
|
Lastly, test the model:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ollama run my-model
|
||||||
|
```
|
||||||
|
|
||||||
|
Ollama supports importing adapters based on several different model architectures including:
|
||||||
|
|
||||||
|
* Llama (including Llama 2, Llama 3, and Llama 3.1);
|
||||||
|
* Mistral (including Mistral 1, Mistral 2, and Mixtral); and
|
||||||
|
* Gemma (including Gemma 1 and Gemma 2)
|
||||||
|
|
||||||
|
You can create the adapter using a fine tuning framework or tool which can output adapters in the Safetensors format, such as:
|
||||||
|
|
||||||
|
* Hugging Face [fine tuning framework] (https://huggingface.co/docs/transformers/en/training)
|
||||||
|
* [Unsloth](https://github.com/unslothai/unsloth)
|
||||||
|
* [MLX](https://github.com/ml-explore/mlx)
|
||||||
|
|
||||||
|
|
||||||
|
## Importing a model from Safetensors weights
|
||||||
|
|
||||||
|
First, create a `Modelfile` with a `FROM` command which points to the directory containing your Safetensors weights:
|
||||||
|
|
||||||
```dockerfile
|
```dockerfile
|
||||||
FROM /path/to/safetensors/directory
|
FROM /path/to/safetensors/directory
|
||||||
```
|
```
|
||||||
|
|
||||||
For architectures not directly convertable by Ollama, see llama.cpp's [guide](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize) on conversion. After conversion, see [Import GGUF](#import-gguf).
|
If you create the Modelfile in the same directory as the weights, you can use the command `FROM .`.
|
||||||
|
|
||||||
## Automatic Quantization
|
Now run the `ollama create` command from the directory where you created the `Modelfile`:
|
||||||
|
|
||||||
> [!NOTE]
|
```shell
|
||||||
> Automatic quantization requires v0.1.35 or higher.
|
ollama create my-model
|
||||||
|
```
|
||||||
|
|
||||||
Ollama is capable of quantizing FP16 or FP32 models to any of the supported quantizations with the `-q/--quantize` flag in `ollama create`.
|
Lastly, test the model:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
ollama run my-model
|
||||||
|
```
|
||||||
|
|
||||||
|
Ollama supports importing models for several different architectures including:
|
||||||
|
|
||||||
|
* Llama (including Llama 2, Llama 3, and Llama 3.1);
|
||||||
|
* Mistral (including Mistral 1, Mistral 2, and Mixtral);
|
||||||
|
* Gemma (including Gemma 1 and Gemma 2); and
|
||||||
|
* Phi3
|
||||||
|
|
||||||
|
This includes importing foundation models as well as any fine tuned models which which have been _fused_ with a foundation model.
|
||||||
|
|
||||||
|
|
||||||
|
## Importing a GGUF based model or adapter
|
||||||
|
|
||||||
|
If you have a GGUF based model or adapter it is possible to import it into Ollama. You can obtain a GGUF model or adapter by:
|
||||||
|
|
||||||
|
* converting a Safetensors model with the `convert_hf_to_gguf.py` from Llama.cpp;
|
||||||
|
* converting a Safetensors adapter with the `convert_lora_to_gguf.py` from Llama.cpp; or
|
||||||
|
* downloading a model or adapter from a place such as HuggingFace
|
||||||
|
|
||||||
|
To import a GGUF model, create a `Modelfile` containg:
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
FROM /path/to/file.gguf
|
||||||
|
```
|
||||||
|
|
||||||
|
For a GGUF adapter, create the `Modelfile` with:
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
FROM <model name>
|
||||||
|
ADAPTER /path/to/file.gguf
|
||||||
|
```
|
||||||
|
|
||||||
|
When importing a GGUF adapter, it's important to use the same base model as the base model that the adapter was created with. You can use:
|
||||||
|
|
||||||
|
* a model from Ollama
|
||||||
|
* a GGUF file
|
||||||
|
* a Safetensors based model
|
||||||
|
|
||||||
|
Once you have created your `Modelfile`, use the `ollama create` command to build the model.
|
||||||
|
|
||||||
|
```shell
|
||||||
|
ollama create my-model
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quantizing a Model
|
||||||
|
|
||||||
|
Quantizing a model allows you to run models faster and with less memory consumption but at reduced accuracy. This allows you to run a model on more modest hardware.
|
||||||
|
|
||||||
|
Ollama can quantize FP16 and FP32 based models into different quantization levels using the `-q/--quantize` flag with the `ollama create` command.
|
||||||
|
|
||||||
|
First, create a Modelfile with the FP16 or FP32 based model you wish to quantize.
|
||||||
|
|
||||||
```dockerfile
|
```dockerfile
|
||||||
FROM /path/to/my/gemma/f16/model
|
FROM /path/to/my/gemma/f16/model
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Use `ollama create` to then create the quantized model.
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
$ ollama create -q Q4_K_M mymodel
|
$ ollama create --quantize q4_K_M mymodel
|
||||||
transferring model data
|
transferring model data
|
||||||
quantizing F16 model to Q4_K_M
|
quantizing F16 model to Q4_K_M
|
||||||
creating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bd
|
creating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bd
|
||||||
|
@ -49,42 +134,53 @@ success
|
||||||
|
|
||||||
### Supported Quantizations
|
### Supported Quantizations
|
||||||
|
|
||||||
- `Q4_0`
|
- `q4_0`
|
||||||
- `Q4_1`
|
- `q4_1`
|
||||||
- `Q5_0`
|
- `q5_0`
|
||||||
- `Q5_1`
|
- `q5_1`
|
||||||
- `Q8_0`
|
- `q8_0`
|
||||||
|
|
||||||
#### K-means Quantizations
|
#### K-means Quantizations
|
||||||
|
|
||||||
- `Q3_K_S`
|
- `q3_K_S`
|
||||||
- `Q3_K_M`
|
- `q3_K_M`
|
||||||
- `Q3_K_L`
|
- `q3_K_L`
|
||||||
- `Q4_K_S`
|
- `q4_K_S`
|
||||||
- `Q4_K_M`
|
- `q4_K_M`
|
||||||
- `Q5_K_S`
|
- `q5_K_S`
|
||||||
- `Q5_K_M`
|
- `q5_K_M`
|
||||||
- `Q6_K`
|
- `q6_K`
|
||||||
|
|
||||||
## Template Detection
|
|
||||||
|
|
||||||
> [!NOTE]
|
## Sharing your model on ollama.com
|
||||||
> Template detection requires v0.1.42 or higher.
|
|
||||||
|
|
||||||
Ollama uses model metadata, specifically `tokenizer.chat_template`, to automatically create a template appropriate for the model you're importing.
|
You can share any model you have created by pushing it to [ollama.com](https://ollama.com) so that other users can try it out.
|
||||||
|
|
||||||
```dockerfile
|
First, use your browser to go to the [Ollama Sign-Up](https://ollama.com/signup) page. If you already have an account, you can skip this step.
|
||||||
FROM /path/to/my/gemma/model
|
|
||||||
```
|
![Sign-Up](images/signup.png)
|
||||||
|
|
||||||
|
The `Username` field will be used as part of your model's name (e.g. `jmorganca/mymodel`), so make sure you are comfortable with the username that you have selected.
|
||||||
|
|
||||||
|
Now that you have created an account and are signed-in, go to the [Ollama Keys Settings](https://ollama.com/settings/keys) page.
|
||||||
|
|
||||||
|
Follow the directions on the page to determine where your Ollama Public Key is located.
|
||||||
|
|
||||||
|
![Ollama Key](images/ollama-keys.png)
|
||||||
|
|
||||||
|
Click on the `Add Ollama Public Key` button, and copy and paste the contents of your Ollama Public Key into the text field.
|
||||||
|
|
||||||
|
To push a model to [ollama.com](https://ollama.com), first make sure that it is named correctly with your username. You may have to use the `ollama cp` command to copy
|
||||||
|
your model to give it the correct name. Once you're happy with your model's name, use the `ollama push` command to push it to [ollama.com](https://ollama.com).
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
$ ollama create mymodel
|
ollama cp mymodel myuser/mymodel
|
||||||
transferring model data
|
ollama push myuser/mymodel
|
||||||
using autodetected template gemma-instruct
|
```
|
||||||
creating new layer sha256:baa2a0edc27d19cc6b7537578a9a7ba1a4e3214dc185ed5ae43692b319af7b84
|
|
||||||
creating new layer sha256:ba66c3309914dbef07e5149a648fd1877f030d337a4f240d444ea335008943cb
|
Once your model has been pushed, other users can pull and run it by using the command:
|
||||||
writing manifest
|
|
||||||
success
|
```shell
|
||||||
|
ollama run myuser/mymodel
|
||||||
```
|
```
|
||||||
|
|
||||||
Defining a template in the Modelfile will disable this feature which may be useful if you want to use a different template than the autodetected one.
|
|
||||||
|
|
Loading…
Reference in a new issue