Search code examples
cudapytorchvulkanhpc

How do I make sure Vulkan is using the same GPU as CUDA?


I'm using an application that uses both vulkan and cuda (specifically pytorch) on an HPC cluster (univa grid engine).

When a job is submitted, the cluster scheduler sets an environment variable SGE_HGR_gpu which contains a GPU ID for the job to use (so other jobs run by other users do not use the same GPU)

The typical way to tell an application that uses CUDA to use a specific GPU is to set CUDA_VISIBLE_DEVICES=n

As i'm also using Vulkan, I dont know how to make sure that I choose the same device from those that are listed with vkEnumeratePhysicalDevices.

I think that the order of the values that 'n' can take is the same as the order of the devices on the PCI BUS, however I dont know if the order of the devices returned by vkEnumeratePhysicalDevices are in this order, and the documentation does not specify what this order is.

So how can I go about making sure i'm choosing the same physical GPU for both Vulkan and CUDA?


Solution

  • With VkPhysicalDeviceIDPropertiesKHR (Vulkan 1.1) resp VkPhysicalDeviceVulkan11Properties (Vulkan 1.2) you can get device UUID, which is one of the formats CUDA_VISIBLE_DEVICES seems to use. You should also be able to convert index to UUID (or vice versa) with nvidia-smi -L (or with NVML library).

    Or other way around, cudaDeviceProp includes PCI info which could be compared to VK_EXT_pci_bus_info extensions output.

    If the order matches in Vulkan, it is best to ask NVidia directly; I cannot find info how NV orders them. IIRC from the Vulkan Loader implementation, the order should match the order from registry, and then the order the NV driver itself gives them. Even so you would have to filter non-NV GPUs from the list in generic code, and you do not know if the NV Vulkan ICD implementation matches CUDA without asking NV.