Figuring out GPU links topology programmatically with CUDA

I'm trying to figure out link topology between GPUs. Basically, do pretty much the same nvidia-smi topo -m does.

I've found a CUDA example topologyQuery, which basically calls cudaDeviceGetP2PAttribute(&perfRank, cudaDevP2PAttrPerformanceRank, device1, device2) for each pair of GPUs.

The results of running this example (I modified the output presentation a bit) are very confusing to me (after comparing to results of nvidia-smi topo -m on the same machine):

$ ./topologyQuery
        X       1       1       0       0       0       0       0
        1       X       0       1       0       0       0       0
        1       0       X       0       0       0       1       0
        0       1       0       X       0       0       0       1
        0       0       0       0       X       1       1       0
        0       0       0       0       1       X       0       1
        0       0       1       0       1       0       X       0
        0       0       0       1       0       1       0       X

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity
GPU0     X      NV1     NV1     NV2     NV2     PHB     PHB     PHB     0-95
GPU1    NV1      X      NV2     NV1     PHB     NV2     PHB     PHB     0-95
GPU2    NV1     NV2      X      NV2     PHB     PHB     NV1     PHB     0-95
GPU3    NV2     NV1     NV2      X      PHB     PHB     PHB     NV1     0-95
GPU4    NV2     PHB     PHB     PHB      X      NV1     NV1     NV2     0-95
GPU5    PHB     NV2     PHB     PHB     NV1      X      NV2     NV1     0-95
GPU6    PHB     PHB     NV1     PHB     NV1     NV2      X      NV2     0-95
GPU7    PHB     PHB     PHB     NV1     NV2     NV1     NV2      X      0-95

From https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html:

cudaDevP2PAttrPerformanceRank: A relative value indicating the performance of the link between two devices. Lower value means better performance (0 being the value used for most performant link).

Why NV1 got rank 1? Why PHB got rank 0? Am I misunderstanding the purpose of cudaDevP2PAttrPerformanceRank query?

Solution

I don't know exactly what kind of system you are testing on (it looks approximately like output from a DGX-1).

With respect to this question:

Why PHB got rank 0?

If you run the original topologyQuery sample code, you'll see (at least on DGX-1 like systems) that it does not print out a performance rank for every GPU pair. From what I can see, it does not print out a performance rank for the places where PHB is indicated. If you study the orginal code, the reason for this is clear: P2P is not supported on those pair combinations. Your code, however, seems to print out a zero in these cases. So I would say that is a defect in your code as compared to the original topologyQuery code, and it is leading to this question and your misunderstanding. PHB did not get assigned rank 0 by the original code. But your modified code does that. So that's for you to answer.

Why NV1 got rank 1?

With respect to the remainder, an NV2 connection implies a dual-link NVLink connection between those 2 GPUs (50GB/s per direction). This would be the most performant kind of link (in that particular system), so it is assigned a link value of 0.

An NV1 connection implies a single-link NVLink connection (25GB/s per direction). This would be less performant than NV2 so it is assigned a link performance value of 1. Increasing performance numbers indicate decreasing link performance.

As an aside, if your intent is to do this:

Basically, do pretty much the same nvidia-smi topo -m does.

You won't be able to do that strictly with CUDA API calls.

For reference, here is the nvidia-smi topo -m output and ./topologyQuery output for a DGX-1:

# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity
GPU0     X      NV1     NV1     NV2     NV2     PHB     PHB     PHB     0-79
GPU1    NV1      X      NV2     NV1     PHB     NV2     PHB     PHB     0-79
GPU2    NV1     NV2      X      NV2     PHB     PHB     NV1     PHB     0-79
GPU3    NV2     NV1     NV2      X      PHB     PHB     PHB     NV1     0-79
GPU4    NV2     PHB     PHB     PHB      X      NV1     NV1     NV2     0-79
GPU5    PHB     NV2     PHB     PHB     NV1      X      NV2     NV1     0-79
GPU6    PHB     PHB     NV1     PHB     NV1     NV2      X      NV2     0-79
GPU7    PHB     PHB     PHB     NV1     NV2     NV1     NV2      X      0-79

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks
# ./topologyQuery
GPU0 <-> GPU1:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU0 <-> GPU2:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU0 <-> GPU3:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU0 <-> GPU4:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU1 <-> GPU0:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU1 <-> GPU2:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU1 <-> GPU3:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU1 <-> GPU5:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU2 <-> GPU0:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU2 <-> GPU1:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU2 <-> GPU3:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU2 <-> GPU6:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU3 <-> GPU0:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU3 <-> GPU1:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU3 <-> GPU2:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU3 <-> GPU7:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU4 <-> GPU0:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU4 <-> GPU5:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU4 <-> GPU6:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU4 <-> GPU7:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU5 <-> GPU1:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU5 <-> GPU4:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU5 <-> GPU6:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU5 <-> GPU7:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU6 <-> GPU2:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU6 <-> GPU4:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU6 <-> GPU5:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU6 <-> GPU7:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU7 <-> GPU3:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU7 <-> GPU4:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU7 <-> GPU5:
  * Atomic Supported: yes
  * Perf Rank: 1
GPU7 <-> GPU6:
  * Atomic Supported: yes
  * Perf Rank: 0
GPU0 <-> CPU:
  * Atomic Supported: no
GPU1 <-> CPU:
  * Atomic Supported: no
GPU2 <-> CPU:
  * Atomic Supported: no
GPU3 <-> CPU:
  * Atomic Supported: no
GPU4 <-> CPU:
  * Atomic Supported: no
GPU5 <-> CPU:
  * Atomic Supported: no
GPU6 <-> CPU:
  * Atomic Supported: no
GPU7 <-> CPU:
  * Atomic Supported: no