c826e57475
-Update mllama to take the cross attention state as embeddings in a batch, more similar to how Llava handles it. This improves integration with the input cache. -Pass locations in a prompt for embeddings using tags similar to Llava. -Abstract interface to vision models so the main runner accesses Clip and Mllama similarly Co-authored-by: Michael Yang <mxyng@pm.me> |
||
---|---|---|
.. | ||
cache.go | ||
cache_test.go | ||
image.go | ||
image_test.go | ||
README.md | ||
requirements.go | ||
runner.go | ||
stop.go | ||
stop_test.go |
runner
Note: this is a work in progress
A minimial runner for loading a model and running inference via a http web server.
./runner -model <model binary>
Completion
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion
Embeddings
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings