Search code examples
gpuopenclcpugpgpupci-e

Do integrated GPUs in CPUs have the overhead of transferring data over the PCIe bus just like transferring data between CPU and dedicated GPU?


do integrated GPUs in CPUs have the overhead of transferring data over the PCIe bus just like transferring data between CPU and dedicated GPU?

I ask the question because my OpenCL GPU-accelerated computation performs better on the integrated GPU Intel(R) Iris(R) Xe Graphics than the dedicated NVIDIA T500 GPU for very large datasets. I'm trying to explain that to myself. However, with small data, the NIVIDA GPU is faster.

The performance tests:

N is also the number of threads!

OpenCL GPU Intel(R) Iris(R) Xe Graphics

PerformanceTestOpenCLAccelerationCalculation, N100

  1. Processing took: 0.0016272 seconds for N = 100
  2. Processing took: 0.0014396 seconds for N = 100
  3. Processing took: 0.0012167 seconds for N = 100
  4. Processing took: 0.0017356 seconds for N = 100
  5. Processing took: 0.0011649 seconds for N = 100

PerformanceTestOpenCLAccelerationCalculation, N1_000

  1. Processing took: 0.0026064 seconds for N = 1000
  2. Processing took: 0.0023835 seconds for N = 1000
  3. Processing took: 0.0024709 seconds for N = 1000
  4. Processing took: 0.002758 seconds for N = 1000
  5. Processing took: 0.0028464 seconds for N = 1000

PerformanceTestOpenCLAccelerationCalculation, N10_000

  1. Processing took: 0.0207773 seconds for N = 10000
  2. Processing took: 0.0205991 seconds for N = 10000
  3. Processing took: 0.0204573 seconds for N = 10000
  4. Processing took: 0.0213179 seconds for N = 10000
  5. Processing took: 0.0213915 seconds for N = 10000

PerformanceTestOpenCLAccelerationCalculation, N100_000

  1. Processing took: 1.2154342 seconds for N = 100000
  2. Processing took: 1.2097847 seconds for N = 100000
  3. Processing took: 1.214667 seconds for N = 100000
  4. Processing took: 1.2179605 seconds for N = 100000
  5. Processing took: 1.2161523 seconds for N = 100000

PerformanceTestOpenCLAccelerationCalculation, N1_000_000

  1. Processing took: 3.0097172 seconds for N = 1000000
  2. Processing took: 7.0101114 seconds for N = 1000000
  3. Processing took: 3.5101502 seconds for N = 1000000
  4. Processing took: 3.0101302 seconds for N = 1000000
  5. Processing took: 7.0098925 seconds for N = 1000000

OpenCL NVIDIA T500 GPU

PerformanceTestOpenCLAccelerationCalculation, N100

  1. Processing took: 0.0008086 seconds for N = 100
  2. Processing took: 0.0007528 seconds for N = 100
  3. Processing took: 0.000913 seconds for N = 100
  4. Processing took: 0.0008781 seconds for N = 100
  5. Processing took: 0.0007748 seconds for N = 100

PerformanceTestOpenCLAccelerationCalculation, N1_000

  1. Processing took: 0.0009754 seconds for N = 1000
  2. Processing took: 0.0009548 seconds for N = 1000
  3. Processing took: 0.0010413 seconds for N = 1000
  4. Processing took: 0.0009378 seconds for N = 1000
  5. Processing took: 0.0009792 seconds for N = 1000

PerformanceTestOpenCLAccelerationCalculation, N10_000

  1. Processing took: 0.0048292 seconds for N = 10000
  2. Processing took: 0.0049848 seconds for N = 10000
  3. Processing took: 0.0048261 seconds for N = 10000
  4. Processing took: 0.0048766 seconds for N = 10000
  5. Processing took: 0.3641353 seconds for N = 100000

PerformanceTestOpenCLAccelerationCalculation, N100_000

  1. Processing took: 0.3648832 seconds for N = 100000
  2. Processing took: 0.364661 seconds for N = 100000
  3. Processing took: 0.3648172 seconds for N = 10000
  4. Processing took: 0.3643978 seconds for N = 100000
  5. Processing took: 0.3636657 seconds for N = 100000

PerformanceTestOpenCLAccelerationCalculation, N1_000_000

  1. Processing took: 36.6879734 seconds for N = 1000000
  2. Processing took: 37.7439221 seconds for N = 1000000
  3. Processing took: 37.7102053 seconds for N = 1000000
  4. Processing took: 37.7474797 seconds for N = 1000000
  5. Processing took: 37.4340912 seconds for N = 1000000

Solution

  • The integrated GPU uses part of the CPU system memory. Data is not moved over PCIe, which makes host<->device copying a lot faster. It is also possible that data copying is avoided entirely, when the iGPU uses the host pointer and not a separate copy in system memory. If your application heavily relies on host<->device copy, this might explain the performance difference. Try to avoid data movement as much as possible.

    But looking at your performance numbers, this seems unlikely. The T500 does well up to N=100000, but then performance suddenly tanks. To me this indicates that you run out of GPU VRAM (only 2GB), and then it spills into system memory with PCIe copy for every single Kernel call, which makes it suddenly super slow. Check VRAM usage.