Search code examples
c#parallel-processingcudagpugpgpu

Can I utilise a GPU to accelerate a non graphics related operation in C# such as a parallel for loop?


I have the following CRC calculation that is executed 12 times in parallel on different data sources.

Can I offload this to the GPU once the CPU thread count is exhausted, or is the GPU not suited for such tasks and it does not make sense to do such a calculation on the GPU ?

If this is the wrong place to ask this question, could you please suggest where it should be asked.

private static readonly byte[] _crcLookup = new byte[1024];

public static uint CalculateCRC(byte[] data, uint lower, uint upper)
{
    uint offset = 0;
    uint addr = 0;

    var segment = data;

    uint crc = uint.MaxValue;
    addr = lower;
    while (addr <= upper)
    {
        crc = crc >> 8 ^ _crcLookup [(byte)(data[addr] ^ crc)];
        addr++;
    }

    crc = ~crc;
    return crc;
}

Parallel implementation

var dataSegments = new ConcurrentBag<(byte[] data, uint lower, uint upper)>();

Parallel.ForEach(dataSegments, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, segment =>
{
    uint result = CalculateCRC(segment.data, segment.lower, segment.upper);
    // Do something with the result...
});

Solution

  • Can I offload this to the GPU once the CPU thread count is exhausted, or is the GPU not suited for such tasks and it does not make sense to do such a calculation on the GPU ?

    Technically yes, but only with great difficulty. You cannot run arbitrary c# code on a GPU, so you would likely need to write the GPU code in some other language, with all the complexity that entails.

    But chances are you will only see a performance reduction. CRC calculations should be IO limited if the code is decently well optimized. So the extra overhead to transfer data to the GPU would very likely cost more than it could benefit. Also, GPUs are designed for massive parallelism. In the while (addr <= upper)-loop you have a dependency between iterations, so it could not directly be parallelized. It might be possible to do some hierarchical version of CRC, but even then, the overhead would be prohibitive.

    Parallelizing on the 12 different data sources should be done on different threads, not on the GPU. A CPU core is much faster than a "GPU core". You just have single/double digit CPU cores, but possibly thousands of GPU cores.