I'm trying to figure out link topology between GPUs.
Basically, do pretty much the same nvidia-smi topo -m
does.
I've found a CUDA example topologyQuery
, which basically calls
cudaDeviceGetP2PAttribute(&perfRank, cudaDevP2PAttrPerformanceRank, device1, device2)
for each pair of GPUs.
The results of running this example (I modified the output presentation a bit) are very confusing to me (after comparing to results of nvidia-smi topo -m
on the same machine):
$ ./topologyQuery
X 1 1 0 0 0 0 0
1 X 0 1 0 0 0 0
1 0 X 0 0 0 1 0
0 1 0 X 0 0 0 1
0 0 0 0 X 1 1 0
0 0 0 0 1 X 0 1
0 0 1 0 1 0 X 0
0 0 0 1 0 1 0 X
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X NV1 NV1 NV2 NV2 PHB PHB PHB 0-95
GPU1 NV1 X NV2 NV1 PHB NV2 PHB PHB 0-95
GPU2 NV1 NV2 X NV2 PHB PHB NV1 PHB 0-95
GPU3 NV2 NV1 NV2 X PHB PHB PHB NV1 0-95
GPU4 NV2 PHB PHB PHB X NV1 NV1 NV2 0-95
GPU5 PHB NV2 PHB PHB NV1 X NV2 NV1 0-95
GPU6 PHB PHB NV1 PHB NV1 NV2 X NV2 0-95
GPU7 PHB PHB PHB NV1 NV2 NV1 NV2 X 0-95
From https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html:
cudaDevP2PAttrPerformanceRank: A relative value indicating the performance of the link between two devices. Lower value means better performance (0 being the value used for most performant link).
Why NV1 got rank 1? Why PHB got rank 0?
Am I misunderstanding the purpose of cudaDevP2PAttrPerformanceRank
query?
I don't know exactly what kind of system you are testing on (it looks approximately like output from a DGX-1).
With respect to this question:
Why PHB got rank 0?
If you run the original topologyQuery
sample code, you'll see (at least on DGX-1 like systems) that it does not print out a performance rank for every GPU pair. From what I can see, it does not print out a performance rank for the places where PHB
is indicated. If you study the orginal code, the reason for this is clear: P2P is not supported on those pair combinations. Your code, however, seems to print out a zero in these cases. So I would say that is a defect in your code as compared to the original topologyQuery
code, and it is leading to this question and your misunderstanding. PHB
did not get assigned rank 0 by the original code. But your modified code does that. So that's for you to answer.
Why NV1 got rank 1?
With respect to the remainder, an NV2
connection implies a dual-link NVLink connection between those 2 GPUs (50GB/s per direction). This would be the most performant kind of link (in that particular system), so it is assigned a link value of 0.
An NV1
connection implies a single-link NVLink connection (25GB/s per direction). This would be less performant than NV2
so it is assigned a link performance value of 1. Increasing performance numbers indicate decreasing link performance.
As an aside, if your intent is to do this:
Basically, do pretty much the same
nvidia-smi topo -m
does.
You won't be able to do that strictly with CUDA API calls.
For reference, here is the nvidia-smi topo -m
output and ./topologyQuery
output for a DGX-1:
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X NV1 NV1 NV2 NV2 PHB PHB PHB 0-79
GPU1 NV1 X NV2 NV1 PHB NV2 PHB PHB 0-79
GPU2 NV1 NV2 X NV2 PHB PHB NV1 PHB 0-79
GPU3 NV2 NV1 NV2 X PHB PHB PHB NV1 0-79
GPU4 NV2 PHB PHB PHB X NV1 NV1 NV2 0-79
GPU5 PHB NV2 PHB PHB NV1 X NV2 NV1 0-79
GPU6 PHB PHB NV1 PHB NV1 NV2 X NV2 0-79
GPU7 PHB PHB PHB NV1 NV2 NV1 NV2 X 0-79
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
# ./topologyQuery
GPU0 <-> GPU1:
* Atomic Supported: yes
* Perf Rank: 1
GPU0 <-> GPU2:
* Atomic Supported: yes
* Perf Rank: 1
GPU0 <-> GPU3:
* Atomic Supported: yes
* Perf Rank: 0
GPU0 <-> GPU4:
* Atomic Supported: yes
* Perf Rank: 0
GPU1 <-> GPU0:
* Atomic Supported: yes
* Perf Rank: 1
GPU1 <-> GPU2:
* Atomic Supported: yes
* Perf Rank: 0
GPU1 <-> GPU3:
* Atomic Supported: yes
* Perf Rank: 1
GPU1 <-> GPU5:
* Atomic Supported: yes
* Perf Rank: 0
GPU2 <-> GPU0:
* Atomic Supported: yes
* Perf Rank: 1
GPU2 <-> GPU1:
* Atomic Supported: yes
* Perf Rank: 0
GPU2 <-> GPU3:
* Atomic Supported: yes
* Perf Rank: 0
GPU2 <-> GPU6:
* Atomic Supported: yes
* Perf Rank: 1
GPU3 <-> GPU0:
* Atomic Supported: yes
* Perf Rank: 0
GPU3 <-> GPU1:
* Atomic Supported: yes
* Perf Rank: 1
GPU3 <-> GPU2:
* Atomic Supported: yes
* Perf Rank: 0
GPU3 <-> GPU7:
* Atomic Supported: yes
* Perf Rank: 1
GPU4 <-> GPU0:
* Atomic Supported: yes
* Perf Rank: 0
GPU4 <-> GPU5:
* Atomic Supported: yes
* Perf Rank: 1
GPU4 <-> GPU6:
* Atomic Supported: yes
* Perf Rank: 1
GPU4 <-> GPU7:
* Atomic Supported: yes
* Perf Rank: 0
GPU5 <-> GPU1:
* Atomic Supported: yes
* Perf Rank: 0
GPU5 <-> GPU4:
* Atomic Supported: yes
* Perf Rank: 1
GPU5 <-> GPU6:
* Atomic Supported: yes
* Perf Rank: 0
GPU5 <-> GPU7:
* Atomic Supported: yes
* Perf Rank: 1
GPU6 <-> GPU2:
* Atomic Supported: yes
* Perf Rank: 1
GPU6 <-> GPU4:
* Atomic Supported: yes
* Perf Rank: 1
GPU6 <-> GPU5:
* Atomic Supported: yes
* Perf Rank: 0
GPU6 <-> GPU7:
* Atomic Supported: yes
* Perf Rank: 0
GPU7 <-> GPU3:
* Atomic Supported: yes
* Perf Rank: 1
GPU7 <-> GPU4:
* Atomic Supported: yes
* Perf Rank: 0
GPU7 <-> GPU5:
* Atomic Supported: yes
* Perf Rank: 1
GPU7 <-> GPU6:
* Atomic Supported: yes
* Perf Rank: 0
GPU0 <-> CPU:
* Atomic Supported: no
GPU1 <-> CPU:
* Atomic Supported: no
GPU2 <-> CPU:
* Atomic Supported: no
GPU3 <-> CPU:
* Atomic Supported: no
GPU4 <-> CPU:
* Atomic Supported: no
GPU5 <-> CPU:
* Atomic Supported: no
GPU6 <-> CPU:
* Atomic Supported: no
GPU7 <-> CPU:
* Atomic Supported: no