History

Jesse Gross 26acdcf44e runner.go: Don't set cross attention before sending embeddings Currently if an input has embeddings at any point then we will set cross attention to true from the beginning. This means that any tokens before the embeddings are sent will incorrectly have cross attention layers applied. This only sets cross attention when we have an embedding, either previously in this sequence or in the cache. It also makes cross attention capable of supporting parallelism at the runner level, though the mllama implementation doesn't support that yet.		2024-10-31 13:56:08 -07:00
..
cache.go	runner.go: Better abstract vision model integration	2024-10-30 14:53:43 -07:00
cache_test.go	runner.go: Better abstract vision model integration	2024-10-30 14:53:43 -07:00
image.go	runner.go: Don't set cross attention before sending embeddings	2024-10-31 13:56:08 -07:00
image_test.go	runner.go: Better abstract vision model integration	2024-10-30 14:53:43 -07:00
README.md	Re-introduce the `llama` package (#5034 )	2024-10-08 08:53:54 -07:00
requirements.go	Re-introduce the `llama` package (#5034 )	2024-10-08 08:53:54 -07:00
runner.go	runner.go: Don't set cross attention before sending embeddings	2024-10-31 13:56:08 -07:00
stop.go	runner.go: Handle truncation of tokens for stop sequences	2024-10-09 20:39:04 -07:00
stop_test.go	runner.go: Handle truncation of tokens for stop sequences	2024-10-09 20:39:04 -07:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings