Commit graph

5 commits

Author SHA1 Message Date
Daniel Hiltgen
b111aa5a91
Debug logging for nvcuda init (#7532)
Some users are reporting crashes during nvcuda.dll initialization
on windows.  This should help narrow down where things are going bad.
2024-11-07 14:25:53 -08:00
Daniel Hiltgen
29ab9fa7d7
nvidia libs have inconsistent ordering (#7473)
The runtime and management libraries may not always have
identical ordering, so use the device UUID to correlate instead of ID.
2024-11-02 16:35:41 -07:00
Daniel Hiltgen
16f4eabe2d
Refine default thread selection for NUMA systems (#7322)
Until we have full NUMA support, this adjusts the default thread selection
algorithm to count up the number of performance cores across all sockets.
2024-10-30 15:05:45 -07:00
Daniel Hiltgen
d7c94e0ca6
Better support for AMD multi-GPU on linux (#7212)
* Better support for AMD multi-GPU

This resolves a number of problems related to AMD multi-GPU setups on linux.

The numeric IDs used by rocm are not the same as the numeric IDs exposed in
sysfs although the ordering is consistent.  We have to count up from the first
valid gfx (major/minor/patch with non-zero values) we find starting at zero.

There are 3 different env vars for selecting GPUs, and only ROCR_VISIBLE_DEVICES
supports UUID based identification, so we should favor that one, and try
to use UUIDs if detected to avoid potential ordering bugs with numeric IDs

* ROCR_VISIBLE_DEVICES only works on linux

Use the numeric ID only HIP_VISIBLE_DEVICES on windows
2024-10-26 14:04:14 -07:00
Daniel Hiltgen
05cd82ef94
Rename gpu package discover (#7143)
Cleaning up go package naming
2024-10-16 17:45:00 -07:00