c#opengl gpgpu point-clouds compute-shader

How to mitigate CPU to GPU bottleneck

I am using ComputeSharp library to run a compute shader on a very large set of data. The dataset is around 10GB, separated into smaller (about 3Gb) pieces for GPU to handle. The problem is that each piece takes about 1s to load, compute and return, even though the computation is almost instant.

I am looking for a way to speed this up, as now it gets outperformed by CPU in certain cases.

More details: The dataset consists of custom points forming a point cloud. The shader is finding the points with highest values and using those to render an image. The max size of the point cloud will be about 500million points.

The points are already as small as they can be, saving all the metadata in a single int. Everything gets put in a buffer and passed to shader which spits out another buffer with result. I already tried and failed to use textures as they do not support custom types.

Edit (Minimal reproduction):

public struct DataPoint
{
    public float3 Position;
    public uint Value;
}

public void ComputeOneChunk(DataPoint[] dataPoints)
{
    var stopWatch = new Stopwatch();
    stopWatch.Start();
    using var currentChunk = _gpu.AllocateReadOnlyBuffer(dataPoints);
    stopWatch.Stop();
    Debug.WriteLine($"Buffering took {stopWatch.ElapsedMilliseconds}ms");
    stopWatch.Restart();
    _gpu.For(dataPoints.Count, 1, new FindMax(
        currentChunk,
        resultBuffer));
    stopWatch.Stop();
    Debug.WriteLine($"Execution took {stopWatch.ElapsedMilliseconds}ms");
}

[AutoConstructor]
public readonly partial struct FindMaxForWell : IComputeShader
{
    public readonly ReadOnlyBuffer<DataPoint> buffer;
    public readonly ReadWriteBuffer<uint> resultBuffer;

    public void Execute()
    {
        //DoStuff
    }
}

Solution

Found a solution. It seems that converting dataPoints array to a buffer takes a long time and involves some intermediate steps. To speed this up you can use UploadBuffer which eliminates these extra steps.

_currentChunk = _gpu.AllocateReadOnlyBuffer<DataPoint>(uploadBuffer.Length); 
_currentChunk.CopyFrom(uploadBuffer);

This changed 1s to around 300ms, which is probably as fast as it will ever go over PCIe.

Another problem appeared however, and that is that UploadBuffer uses Span and accessing it to save data is very slow, which I am still trying to solve.