llama.cpp/examples/ray/README.md

This is an example of doing LLM inference with [Ray](https://docs.ray.io/en/latest/index.html) and [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).

First, install the requirements:

```bash
$ pip install -r requirements.txt
```

Deploy a GGUF model to Ray Serve with the following command:

```bash
$ serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'
```

This will start an API endpoint at `http://localhost:8000/`. You can query the model like this:

```bash
$ curl -k -d '{"prompt": "tell me a joke", "max_tokens": 128}' -X POST http://localhost:8000
```
example: LLM inference with Ray Serve (#1465) 2024-05-17 17:27:26 +00:00			`This is an example of doing LLM inference with [Ray](https://docs.ray.io/en/latest/index.html) and [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).`

			`First, install the requirements:`

			```bash
			`$ pip install -r requirements.txt`
			```

			`Deploy a GGUF model to Ray Serve with the following command:`

			```bash
			`$ serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'`
			```

			This will start an API endpoint at `http://localhost:8000/`. You can query the model like this:

			```bash
			`$ curl -k -d '{"prompt": "tell me a joke", "max_tokens": 128}' -X POST http://localhost:8000`
			```