* Give unicode test more time to run
Some slower GPUs (or partial CPU/GPU loads) can take more than the default 30s to complete this test
* Give more time for concurrency test
CPU inference can be very slow under stress
We check for partial unicode characters and accumulate them before
sending. However, when we did send, we still sent each individual piece
separately, leading to broken output. This combines everything into
a single group, which is also more efficient.
This also switches to the built-in check for valid unicode characters,
which is stricter. After this, we should never send back an invalid
sequence.
Fixes#7290
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.