Do integrated GPUs in CPUs have the overhead of transferring data over the PCIe bus just like transferring data between CPU and dedicated GPU?

do integrated GPUs in CPUs have the overhead of transferring data over the PCIe bus just like transferring data between CPU and dedicated GPU?

I ask the question because my OpenCL GPU-accelerated computation performs better on the integrated GPU Intel(R) Iris(R) Xe Graphics than the dedicated NVIDIA T500 GPU for very large datasets. I'm trying to explain that to myself. However, with small data, the NIVIDA GPU is faster.

The performance tests:

N is also the number of threads!

OpenCL GPU Intel(R) Iris(R) Xe Graphics

PerformanceTestOpenCLAccelerationCalculation, N100

Processing took: 0.0016272 seconds for N = 100
Processing took: 0.0014396 seconds for N = 100
Processing took: 0.0012167 seconds for N = 100
Processing took: 0.0017356 seconds for N = 100
Processing took: 0.0011649 seconds for N = 100

PerformanceTestOpenCLAccelerationCalculation, N1_000

Processing took: 0.0026064 seconds for N = 1000
Processing took: 0.0023835 seconds for N = 1000
Processing took: 0.0024709 seconds for N = 1000
Processing took: 0.002758 seconds for N = 1000
Processing took: 0.0028464 seconds for N = 1000

PerformanceTestOpenCLAccelerationCalculation, N10_000

Processing took: 0.0207773 seconds for N = 10000
Processing took: 0.0205991 seconds for N = 10000
Processing took: 0.0204573 seconds for N = 10000
Processing took: 0.0213179 seconds for N = 10000
Processing took: 0.0213915 seconds for N = 10000

PerformanceTestOpenCLAccelerationCalculation, N100_000

Processing took: 1.2154342 seconds for N = 100000
Processing took: 1.2097847 seconds for N = 100000
Processing took: 1.214667 seconds for N = 100000
Processing took: 1.2179605 seconds for N = 100000
Processing took: 1.2161523 seconds for N = 100000

PerformanceTestOpenCLAccelerationCalculation, N1_000_000

Processing took: 3.0097172 seconds for N = 1000000
Processing took: 7.0101114 seconds for N = 1000000
Processing took: 3.5101502 seconds for N = 1000000
Processing took: 3.0101302 seconds for N = 1000000
Processing took: 7.0098925 seconds for N = 1000000

OpenCL NVIDIA T500 GPU

PerformanceTestOpenCLAccelerationCalculation, N100

Processing took: 0.0008086 seconds for N = 100
Processing took: 0.0007528 seconds for N = 100
Processing took: 0.000913 seconds for N = 100
Processing took: 0.0008781 seconds for N = 100
Processing took: 0.0007748 seconds for N = 100

PerformanceTestOpenCLAccelerationCalculation, N1_000

Processing took: 0.0009754 seconds for N = 1000
Processing took: 0.0009548 seconds for N = 1000
Processing took: 0.0010413 seconds for N = 1000
Processing took: 0.0009378 seconds for N = 1000
Processing took: 0.0009792 seconds for N = 1000

PerformanceTestOpenCLAccelerationCalculation, N10_000

Processing took: 0.0048292 seconds for N = 10000
Processing took: 0.0049848 seconds for N = 10000
Processing took: 0.0048261 seconds for N = 10000
Processing took: 0.0048766 seconds for N = 10000
Processing took: 0.3641353 seconds for N = 100000

PerformanceTestOpenCLAccelerationCalculation, N100_000

Processing took: 0.3648832 seconds for N = 100000
Processing took: 0.364661 seconds for N = 100000
Processing took: 0.3648172 seconds for N = 10000
Processing took: 0.3643978 seconds for N = 100000
Processing took: 0.3636657 seconds for N = 100000

PerformanceTestOpenCLAccelerationCalculation, N1_000_000

Processing took: 36.6879734 seconds for N = 1000000
Processing took: 37.7439221 seconds for N = 1000000
Processing took: 37.7102053 seconds for N = 1000000
Processing took: 37.7474797 seconds for N = 1000000
Processing took: 37.4340912 seconds for N = 1000000

Solution

The integrated GPU uses part of the CPU system memory. Data is not moved over PCIe, which makes host<->device copying a lot faster. It is also possible that data copying is avoided entirely, when the iGPU uses the host pointer and not a separate copy in system memory. If your application heavily relies on host<->device copy, this might explain the performance difference. Try to avoid data movement as much as possible.

But looking at your performance numbers, this seems unlikely. The T500 does well up to N=100000, but then performance suddenly tanks. To me this indicates that you run out of GPU VRAM (only 2GB), and then it spills into system memory with PCIe copy for every single Kernel call, which makes it suddenly super slow. Check VRAM usage.