19 lines
607 B
Markdown
19 lines
607 B
Markdown
This is an example of doing LLM inference with [Ray](https://docs.ray.io/en/latest/index.html) and [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
|
|
|
|
First, install the requirements:
|
|
|
|
```bash
|
|
$ pip install -r requirements.txt
|
|
```
|
|
|
|
Deploy a GGUF model to Ray Serve with the following command:
|
|
|
|
```bash
|
|
$ serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'
|
|
```
|
|
|
|
This will start an API endpoint at `http://localhost:8000/`. You can query the model like this:
|
|
|
|
```bash
|
|
$ curl -k -d '{"prompt": "tell me a joke", "max_tokens": 128}' -X POST http://localhost:8000
|
|
```
|