Daniel Hiltgen b53229a2ed Add docs explaining GPU selection env vars

2024-03-12 11:33:06 -07:00

8.1 KiB

Raw Blame History

FAQ

How can I upgrade Ollama?

Ollama on macOS and Windows will automatically download updates. Click on the taskbar or menubar item and then click "Restart to update" to apply the update. Updates can also be installed by downloading the latest version manually.

On Linux, re-run the install script:

curl -fsSL https://ollama.com/install.sh | sh

How can I view the logs?

Review the Troubleshooting docs for more about using logs.

How can I specify the context window size?

By default, Ollama uses a context window size of 2048 tokens.

To change this when using ollama run, use /set parameter:

/set parameter num_ctx 4096

When using the API, specify the num_ctx parameter:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 4096
  }
}'

How do I configure Ollama server?

Ollama server can be configured with environment variables.

Setting environment variables on Mac

If Ollama is run as a macOS application, environment variables should be set using launchctl:

For each environment variable, call launchctl setenv.
```
launchctl setenv OLLAMA_HOST "0.0.0.0"
```
Restart Ollama application.

Setting environment variables on Linux

If Ollama is run as a systemd service, environment variables should be set using systemctl:

Edit the systemd service by calling systemctl edit ollama.service. This will open an editor.
For each environment variable, add a line Environment under section [Service]:
```
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
```
Save and exit.

Reload systemd and restart Ollama:

systemctl daemon-reload
systemctl restart ollama

Setting environment variables on Windows

On windows, Ollama inherits your user and system environment variables.

First Quit Ollama by clicking on it in the task bar
Edit system environment variables from the control panel
Edit or create New variable(s) for your user account for OLLAMA_HOST, OLLAMA_MODELS, etc.
Click OK/Apply to save
Run ollama from a new terminal window

How can I expose Ollama on my network?

Ollama binds 127.0.0.1 port 11434 by default. Change the bind address with the OLLAMA_HOST environment variable.

Refer to the section above for how to set environment variables on your platform.

How can I allow additional web origins to access Ollama?

Ollama allows cross-origin requests from 127.0.0.1 and 0.0.0.0 by default. Additional origins can be configured with OLLAMA_ORIGINS.

Refer to the section above for how to set environment variables on your platform.

Where are models stored?

macOS: ~/.ollama/models
Linux: /usr/share/ollama/.ollama/models
Windows: C:\Users\<username>\.ollama\models

How do I set them to a different location?

If a different directory needs to be used, set the environment variable OLLAMA_MODELS to the chosen directory.

Refer to the section above for how to set environment variables on your platform.

Does Ollama send my prompts and answers back to ollama.com?

No. Ollama runs locally, and conversation data does not leave your machine.

How can I use Ollama in Visual Studio Code?

There is already a large collection of plugins available for VSCode as well as other editors that leverage Ollama. See the list of extensions & plugins at the bottom of the main repository readme.

How do I use Ollama behind a proxy?

Ollama is compatible with proxy servers if HTTP_PROXY or HTTPS_PROXY are configured. When using either variables, ensure it is set where ollama serve can access the values. When using HTTPS_PROXY, ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.

How do I use Ollama behind a proxy in Docker?

The Ollama Docker container image can be configured to use a proxy by passing -e HTTPS_PROXY=https://proxy.example.com when starting the container.

Alternatively, the Docker daemon can be configured to use a proxy. Instructions are available for Docker Desktop on macOS, Windows, and Linux, and Docker daemon with systemd.

Ensure the certificate is installed as a system certificate when using HTTPS. This may require a new Docker image when using a self-signed certificate.

FROM ollama/ollama
COPY my-ca.pem /usr/local/share/ca-certificates/my-ca.crt
RUN update-ca-certificates

Build and run this image:

docker build -t ollama-with-ca .
docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca

How do I use Ollama with GPU acceleration in Docker?

The Ollama Docker container can be configured with GPU acceleration in Linux or Windows (with WSL2). This requires the nvidia-container-toolkit. See ollama/ollama for more details.

GPU acceleration is not available for Docker Desktop in macOS due to the lack of GPU passthrough and emulation.

Why is networking slow in WSL2 on Windows 10?

This can impact both installing Ollama, as well as downloading models.

Open Control Panel > Networking and Internet > View network status and tasks and click on Change adapter settings on the left panel. Find the vEthernel (WSL) adapter, right click and select Properties. Click on Configure and open the Advanced tab. Search through each of the properties until you find Large Send Offload Version 2 (IPv4) and Large Send Offload Version 2 (IPv6). Disable both of these properties.

How can I pre-load a model to get faster response times?

If you are using the API you can preload a model by sending the Ollama server an empty request. This works with both the /api/generate and /api/chat API endpoints.

To preload the mistral model using the generate endpoint, use:

curl http://localhost:11434/api/generate -d '{"model": "mistral"}'

To use the chat completions endpoint, use:

curl http://localhost:11434/api/chat -d '{"model": "mistral"}'

How do I keep a model loaded in memory or make it unload immediately?

By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the keep_alive parameter with either the /api/generate and /api/chat API endpoints to control how long the model is left in memory.

The keep_alive parameter can be set to:

a duration string (such as "10m" or "24h")
a number in seconds (such as 3600)
any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
'0' which will unload the model immediately after generating a response

For example, to preload a model and leave it in memory use:

curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": -1}'

To unload the model and free up memory use:

curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}'

Controlling which GPUs to use

By default, on Linux and Windows, Ollama will attempt to use Nvidia GPUs, or Radeon GPUs, and will use all the GPUs it can find. You can limit which GPUs will be utilized by setting the environment variable CUDA_VISIBLE_DEVICES for NVIDIA cards, or HIP_VISIBLE_DEVICES for Radeon GPUs to a comma delimited list of GPU IDs. You can see the list of devices with GPU tools such as nvidia-smi or rocminfo. You can set to an invalid GPU ID (e.g., "-1") to bypass the GPU and fallback to CPU.

8.1 KiB Raw Blame History

FAQ

How can I upgrade Ollama?

How can I view the logs?

How can I specify the context window size?

How do I configure Ollama server?

Setting environment variables on Mac

Setting environment variables on Linux

Setting environment variables on Windows

How can I expose Ollama on my network?

How can I allow additional web origins to access Ollama?

Where are models stored?

How do I set them to a different location?

Does Ollama send my prompts and answers back to ollama.com?

How can I use Ollama in Visual Studio Code?

How do I use Ollama behind a proxy?

How do I use Ollama behind a proxy in Docker?

How do I use Ollama with GPU acceleration in Docker?

Why is networking slow in WSL2 on Windows 10?

How can I pre-load a model to get faster response times?

How do I keep a model loaded in memory or make it unload immediately?

Controlling which GPUs to use

8.1 KiB

Raw Blame History