llama.cpp

History

Radoslav Gerganov 03f171e810 example: LLM inference with Ray Serve (#1465 )		2024-05-17 13:27:26 -04:00
..
llm.py	example: LLM inference with Ray Serve (#1465 )	2024-05-17 13:27:26 -04:00
README.md	example: LLM inference with Ray Serve (#1465 )	2024-05-17 13:27:26 -04:00
requirements.txt	example: LLM inference with Ray Serve (#1465 )	2024-05-17 13:27:26 -04:00

This is an example of doing LLM inference with Ray and Ray Serve.

First, install the requirements:

$ pip install -r requirements.txt

Deploy a GGUF model to Ray Serve with the following command:

$ serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'

This will start an API endpoint at http://localhost:8000/. You can query the model like this:

$ curl -k -d '{"prompt": "tell me a joke", "max_tokens": 128}' -X POST http://localhost:8000