Search code examples
c++opengldirectxdirectx-11

Compute Shader execution time between DirectX11 and OpenGL


I am studying Compute Shaders in DirectX and OpenGL

and I wrote some code to test compute shader and checked the execution time.

but there was some difference between DirectX execution time and Opengl's

enter image description here

and above image represent how much different it is (left is DirectX, right is Opengl, time represent nanoseconds)

even DirectX compute Shader is slower than cpu

here is my Code that calculates the both vector's sum one for compute shader and one for cpu

        std::vector<Data> dataA(32);
        std::vector<Data> dataB(32);

        for (int i = 0; i < 32; ++i)
        {
            dataA[i].v1 = glm::vec3(i, i, i);
            dataA[i].v2 = glm::vec2(i, 0);

            dataB[i].v1 = glm::vec3(-i, i, 0.0f);
            dataB[i].v2 = glm::vec2(0, -i);
        }

        InputBufferA = ShaderBuffer::Create(sizeof(Data), 32, BufferType::Read, dataA.data());
        InputBufferB = ShaderBuffer::Create(sizeof(Data), 32, BufferType::Read, dataB.data());
        OutputBufferA =ShaderBuffer::Create(sizeof(Data), 32, BufferType::ReadWrite);

        computeShader->Bind();
        InputBufferA->Bind(0, ShaderType::CS);
        InputBufferB->Bind(1, ShaderType::CS);
        OutputBufferA->Bind(0,ShaderType::CS);

        // Check The Compute Shader Calculation time
        std::chrono::system_clock::time_point time1 = std::chrono::system_clock::now();
        RenderCommand::DispatchCompute(1, 1, 1);
        std::chrono::system_clock::time_point time2 = std::chrono::system_clock::now();
        std::chrono::nanoseconds t =time2- time1;
        QCAT_CORE_INFO("Compute Shader time : {0}", t.count());
        
        // Check The Cpu Calculation time
        std::vector<Data> dataC(32);
        time1 = std::chrono::system_clock::now();
        for (int i = 0; i < 32; ++i)
        {
            dataC[i].v1 = (dataA[i].v1 + dataB[i].v1);
            dataC[i].v2 = (dataA[i].v2 + dataB[i].v2);
        }
        time2 = std::chrono::system_clock::now();
        t = time2 - time1;
        QCAT_CORE_INFO("CPU time : {0}", t.count() );

and here is glsl code

#version 450 core
struct Data
{
    vec3 a;
    vec2 b;
};
layout(std430,binding =0) readonly buffer Data1
{
    Data input1[];
};

layout(std430,binding =1) readonly buffer Data2
{
    Data input2[];
};

layout(std430,binding =2) writeonly buffer Data3
{
    Data outputData[];
};

layout (local_size_x = 32, local_size_y = 1, local_size_z = 1) in;

void main()
{
  uint index = gl_GlobalInvocationID.x;

  outputData[index].a = input1[index].a + input2[index].a;
  outputData[index].b = input1[index].b + input2[index].b;
}

and hlsl code


struct Data
{
    float3 v1;
    float2 v2;
};
StructuredBuffer<Data> gInputA : register(t0);
StructuredBuffer<Data> gInputB : register(t1);
RWStructuredBuffer<Data> gOutput : register(u0);

[numthreads(32,1,1)]
void CSMain(int3  dtid : SV_DispatchThreadID)
{
    gOutput[dtid.x].v1 = gInputA[dtid.x].v1 + gInputB[dtid.x].v1;
    gOutput[dtid.x].v2 = gInputA[dtid.x].v2 + gInputB[dtid.x].v2;
}

pretty simple code isnt it?

but Opengl's performance time is 10 times better than DirectX's time

i dont get it why this is happened is there anything slow the performance??

this is code that when i create RWStructuredBuffer only thing diffrence with StructuredBuffer is BindFlags = D3D11_BIND_SHADER_RESOURCE

        desc.Usage = D3D11_USAGE_DEFAULT;
        desc.ByteWidth = size * count;
        desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
        desc.CPUAccessFlags = 0;
        desc.StructureByteStride = size;
        desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;

        D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
        uavDesc.Format = DXGI_FORMAT_UNKNOWN;
        uavDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
        uavDesc.Buffer.FirstElement = 0;
        uavDesc.Buffer.Flags = 0;
        uavDesc.Buffer.NumElements = count;

and in opengl i create SSBO like this way

    glGenBuffers(1, &m_renderID);
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, m_renderID);
    glBufferData(GL_SHADER_STORAGE_BUFFER, int(size * count), pData, GL_STATIC_DRAW);

this is all code for Execute Compute Shader in both API

and every result show me opengl is better than directx

What properties makes that diffrence?

is in Buffer or ShaderCode?


Solution

  • So first, as mentioned in the comments, you are not measuring GPU execution time, but the time to record the command itself (the gpu will execute it later at some point, then it decides to flush commands).

    In order to measure GPU execution time, you need to use Queries

    In your case (Direct3D11, but similar for OpenGL), you need to create 3 queries :

    • 2 must be of type D3D11_QUERY_TIMESTAMP (to measure start and end time)
    • 1 must be of type D3D11_QUERY_TIMESTAMP_DISJOINT (the disjoint query will indicate that the timestamp results are not valid anymore, for example if the clock frequency of your gpu changes). The disjoint query will also give you the frenquency, which is needed to convert to milliseconds.

    so to measure your gpu time (on device context, you the issue the following):

     d3d11DeviceContext->Begin(yourDisjointQuery);
     d3d11DeviceContext->Begin(yourFirstTimeStampQuery);
    
     Dispatch call goes here
    
     d3d11DeviceContext->Begin(yourSecondTimeStampQuery);
     d3d11DeviceContext->Begin(yourDisjointQuery);
    

    Note that timestamp queries are only calling begin, which is perfectly normal, you just ask the "gpu clock", to simplify.

    Then you can call (order does not matter):

    d3d11DeviceContext->GetData(yourDisjointQuery);
    d3d11DeviceContext->GetData(yourSecondTimeStampQuery);
    d3d11DeviceContext->GetData(yourFirstTimeStampQuery);
    

    Check that the disjoint result is NOT disjoint, and get frequency from it:

    double delta = end - start;
    double frequency;
    double ticks = delta / (freq / 10000000);
    

    So now why does "just" recording that command takes a lot of time versus just doing the same calculation on the CPU.

    You only perform a few addition on 32 elements, which is an extremely trivial and fast operation for a CPU.

    If you start to increase element count then GPU will eventually take over.

    First, if you have your D3D device created with DEBUG flag, remove the flag to profile. Some drivers (NVIDIA in particular) command recording perform very poorly with that flag.

    Second, driver will perform quite a bunch of checks when you call Dispatch (check that resources are of the correct format, correct strides, resource still alive....). DirectX driver tends to do a lot of checks, so it might be slightly slower than GL one (but not by that magnitude, which leads to the last point).

    Last, it is likely that the GPU/Driver does a warm up on your shader (some drivers convert the dx bytecode to their native counterpart asynchronously, so when you call

    device->CreateComputeShader();
    

    It might be done immediately or placed in a queue (AME does the queue thing, see this link Gpu Open Shader Compiler controls). If you call Dispatch before this task is effectively processed, you might have a wait as well.

    Also note that most GPU have a cache on disk nowadays, so the first compile/use might also impact performances.

    So you should try to call Dispatch several times, and check if the CPU timings are different after the first call.