Michael Yang
3cf483fe48
add stablelm graph calculation
2024-04-17 13:57:19 -07:00
Michael Yang
a8b9b930b4
account for all non-repeating layers
2024-04-17 11:21:21 -07:00
Michael Yang
26df674785
scale graph based on gpu count
2024-04-16 14:44:13 -07:00
Michael Yang
41a272de9f
darwin: no partial offloading if required memory greater than system
2024-04-16 11:22:38 -07:00
Jeffrey Morgan
a0b8a32eb4
Terminate subprocess if receiving SIGINT
or SIGTERM
signals while model is loading ( #3653 )
...
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading
* use `unload` in signal handler
2024-04-15 12:09:32 -04:00
Michael Yang
7e33a017c0
partial offloading
2024-04-10 11:37:20 -07:00
Michael Yang
8b2c10061c
refactor tensor query
2024-04-10 11:37:20 -07:00
Daniel Hiltgen
c5ff443b9f
Handle very slow model loads
...
During testing, we're seeing some models take over 3 minutes.
2024-04-09 16:35:10 -07:00
Michael Yang
be517e491c
no rope parameters
2024-04-05 18:05:27 -07:00
Michael Yang
12e923e158
update graph size estimate
2024-04-03 13:34:12 -07:00
Daniel Hiltgen
464d817824
Merge pull request #3464 from dhiltgen/subprocess
...
Fix numgpu opt miscomparison
2024-04-02 20:10:17 -07:00
Daniel Hiltgen
6589eb8a8c
Revert options as a ref in the server
2024-04-02 16:44:10 -07:00
Michael Yang
80163ebcb5
fix metal gpu
2024-04-02 16:06:45 -07:00
Daniel Hiltgen
58d95cc9bd
Switch back to subprocessing for llama.cpp
...
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00