Update examples from ggml to gguf and add hw-accel note for Web Server (#688)
* Examples from ggml to gguf * Use gguf file extension Update examples to use filenames with gguf extension (e.g. llama-model.gguf). --------- Co-authored-by: Andrei <abetlen@gmail.com>
This commit is contained in:
parent
aa2f8a5008
commit
40b22909dc
1 changed files with 14 additions and 7 deletions
21
README.md
21
README.md
|
@ -106,14 +106,14 @@ Below is a short example demonstrating how to use the high-level API to generate
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from llama_cpp import Llama
|
>>> from llama_cpp import Llama
|
||||||
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")
|
>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
|
||||||
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
|
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
|
||||||
>>> print(output)
|
>>> print(output)
|
||||||
{
|
{
|
||||||
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
|
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
|
||||||
"object": "text_completion",
|
"object": "text_completion",
|
||||||
"created": 1679561337,
|
"created": 1679561337,
|
||||||
"model": "./models/7B/ggml-model.bin",
|
"model": "./models/7B/llama-model.gguf",
|
||||||
"choices": [
|
"choices": [
|
||||||
{
|
{
|
||||||
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
|
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
|
||||||
|
@ -136,7 +136,7 @@ The context window of the Llama models determines the maximum number of tokens t
|
||||||
For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:
|
For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
llm = Llama(model_path="./models/7B/ggml-model.bin", n_ctx=2048)
|
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Loading llama-2 70b
|
### Loading llama-2 70b
|
||||||
|
@ -144,7 +144,7 @@ llm = Llama(model_path="./models/7B/ggml-model.bin", n_ctx=2048)
|
||||||
Llama2 70b must set the `n_gqa` parameter (grouped-query attention factor) to 8 when loading:
|
Llama2 70b must set the `n_gqa` parameter (grouped-query attention factor) to 8 when loading:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
llm = Llama(model_path="./models/70B/ggml-model.bin", n_gqa=8)
|
llm = Llama(model_path="./models/70B/llama-model.gguf", n_gqa=8)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Web Server
|
## Web Server
|
||||||
|
@ -156,17 +156,24 @@ To install the server package and get started:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install llama-cpp-python[server]
|
pip install llama-cpp-python[server]
|
||||||
python3 -m llama_cpp.server --model models/7B/ggml-model.bin
|
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
|
||||||
|
```
|
||||||
|
Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]
|
||||||
|
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
|
||||||
```
|
```
|
||||||
|
|
||||||
Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.
|
Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.
|
||||||
|
|
||||||
|
|
||||||
## Docker image
|
## Docker image
|
||||||
|
|
||||||
A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:
|
A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/ggml-model-name.bin ghcr.io/abetlen/llama-cpp-python:latest
|
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
|
||||||
```
|
```
|
||||||
[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)
|
[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)
|
||||||
|
|
||||||
|
@ -183,7 +190,7 @@ Below is a short example demonstrating how to use the low-level API to tokenize
|
||||||
>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
|
>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
|
||||||
>>> params = llama_cpp.llama_context_default_params()
|
>>> params = llama_cpp.llama_context_default_params()
|
||||||
# use bytes for char * params
|
# use bytes for char * params
|
||||||
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/ggml-model.bin", params)
|
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
|
||||||
>>> ctx = llama_cpp.llama_new_context_with_model(model, params)
|
>>> ctx = llama_cpp.llama_new_context_with_model(model, params)
|
||||||
>>> max_tokens = params.n_ctx
|
>>> max_tokens = params.n_ctx
|
||||||
# use ctypes arrays for array params
|
# use ctypes arrays for array params
|
||||||
|
|
Loading…
Reference in a new issue