Why is this compute shader so much slower than vertex shader?

I'm exploring using a compute shader to apply bone deformation to mesh vertices rather than a vertex shader with stream output. I've found the compute shader executes far slower than the vertex shader but before I write it off, I want to be sure I'm not doing something wrong.

With my test data of 100,000 vertices and 1,000 frames of animation data for 300 bones, the vertex shader runs in around 0.22ms while the compute shader takes 4x as long at 0.85ms. The timing is done via D3D API timer queries (rather than a cpu timer).

deform_structs.hlsl

struct Vertex {
  float3 position : POSITION;
  float3 normal : NORMAL;
  float2 texcoord : TEXCOORD;
  float3 tangent : TANGENT;
  float4 color : COLOR;
};

struct BoneWeights {
  uint index;
  float weight;
};

StructuredBuffer<matrix> g_bone_array : register(t0);
Buffer<uint> g_bone_offsets : register(t1);
Buffer<uint> g_bone_counts : register(t2);
StructuredBuffer<BoneWeights> g_bone_weights : register(t3);

bone_deform_cs.hlsl

#include "deform_structs.hlsl"

StructuredBuffer<Vertex> g_input_vertex : register(t4);
RWStructuredBuffer<Vertex> g_output_vertex : register(u0);

[numthreads(64,1,1)]
void BoneDeformCS(uint id : SV_DispatchThreadID) {
  Vertex vert = g_input_vertex[id.x];
  uint offset = g_bone_offsets[id.x];
  uint count = g_bone_counts[id.x];

  matrix bone_matrix = 0;
  for (uint i = offset; i < (offset + count); ++i) {
    BoneWeights weight_info = g_bone_weights[i];
    bone_matrix += weight_info.weight * g_bone_array[weight_info.index];
  }

  vert.position = mul(float4(vert.position,1), bone_matrix).xyz;
  vert.normal = normalize(mul(vert.normal, (float3x3)bone_matrix));
  vert.tangent = normalize(mul(vert.tangent, (float3x3)bone_matrix));
  g_output_vertex[id.x] = vert;
}

bone_deform_vs.hlsl

#include "deform_structs.hlsl"

void BoneDeformVS(uint id : SV_VertexID, Vertex vsin, out Vertex vsout) {
  uint offset = g_bone_offsets[id];
  uint count = g_bone_counts[id];

  matrix bone_matrix = 0;
  for (uint i = offset; i < (offset + count); ++i) {
    BoneWeights bone_info = g_bone_weights[i];
    bone_matrix += bone_info.weight * g_bone_array[bone_info.index];
  }

  vsout.position = mul(float4(vsin.position,1), bone_matrix).xyz;
  vsout.normal = normalize(mul(vsin.normal, (float3x3)bone_matrix));
  vsout.tangent = normalize(mul(vsin.tangent, (float3x3)bone_matrix));
  vsout.texcoord = vsin.texcoord;
  vsout.color = vsin.color;
}

Comparing the contents of the buffers once they've run, they are identical and contain the expected values.

I suspect that maybe I'm executing the compute shader incorrectly, spawning too many threads? Do I have the number I pass to Dispatch wrong? Since it is a 1 dimensional row of data, it made sense to me to use [numthreads(64,1,1)]. I've tried various values from 32-1024. 64 seems to be the sweet spot as it's the minimum needed for efficient use of AMD GPUs. Anyway. When I call Dispatch, I ask it to execute (vertex_count / 64) + (vertex_count % 64 != 0) ? 1 : 0. For 100,000 vertices, the call ends up being Dispatch(1563,1,1).

ID3D11ShaderResourceView * srvs[] = {bone_array_srv, bone_offset_srv,
                                     bone_count_srv, bone_weights_srv,
                                     cs_vertices_srv};
ID3D11UnorderedAccessView * uavs[] = {cs_output_uav};
UINT srv_count = sizeof(srvs) / sizeof(srvs[0]);
UINT uav_count = sizeof(uavs) / sizeof(uavs[0]);
UINT thread_group_count = vertex_count / 64 + (vertex_count % 64 != 0) ? 1 : 0;

context->CSSetShader(cs, nullptr, 0);
context->CSSetShaderResources(0, srv_count, srvs);
context->CSSetUnorderedAccessViews(0, uav_count, uavs);
context->Dispatch(thread_group_count, 1, 1);

And this is how the vertex shader is executed:

ID3D11ShaderResourceView * srvs[] = {bone_array_srv, bone_offset_srv,
                                     bone_count_srv, bone_weights_srv};
UINT srv_count = sizeof(srvs) / sizeof(srvs[0]);
UINT stride = 0;
UINT offset = 0;

context->GSSetShader(streamout_gs, nullptr, 0);
context->VSSetShader(vs, nullptr, 0);
context->VSSetShaderResources(0, srv_count, srvs);
context->SOSetTargets(1, &vs_output_buf, &offset);
context->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_POINTLIST);
context->IASetInputLayout(vs_input_layout);
context->IASetVertexBuffers(0, 1, &vs_vertices, &stride, &offset);
context->Draw(vertex_count, 0);

Or is the answer just that reading from a shader resource view and writing to an unordered access view is just far slower than reading from a vertex buffer and writing to a stream output buffer?

Solution

I’m just learning how to work with compute shaders, so I’m not an expert. Regarding your bone calculation I’m sure that the CS should work at least as fast as the VS. Intuition tells me that numthreads (64,1,1) is less efficient than something like numthreads (16,16,1). So you could give this approach a try:

Treat your linear buffer as if it had a quadratic layout, with x and y size being the same
Compute x/y-size as size = ceil (sqrt (numvertices))
Use dispatch(size / 16, size / 16) in your program and numthreads (16,16,1) in your hlsl file
Allocate a constant buffer where you copy your size and numvertices values
Instead of using id.x as index, you calculate your own (linear) index as int index = id.y * size +id.x), (maybe id.xy is also possible as index)
In most cases size * size will be greater than numvertices, so you’ll end up with more threads than vertices. You can block these extra threads by adding a condition in your hlsl function:
```
int index = id.y * size +id.x;
if (index < numvertices) { .. // your code follows
```

I hope that this approach speeds up your CS calculations.

================ EDIT ==================

My suggestion was based on my own timing tests. In order to verify my case I repeated these tests with more variances of the numthreads parameters. I calculate the mandelbrot set over 1034 x 827 = 855,118 pixels. Here the results:

numthreads       Dispatch      groups  threads/  total
  x   y    fps     x     y             group     threads

  4   4    240    259   207    53445     16     855118
  8   8    550    129   103    13361     64     855118
 16  16    600     65    52     3340    256     855118
 32  32    580     32    26      835   1024     855118
 64   1    550     16   827    13361     64     855118
256   1    460      4   827     3340    256     855118
512   1    370      2   827     1670    512     855118

As you can see, the sweet spot - numthreads(16,16,1) - creates the same #of thread groups (3340) as numthreads(256,1,1), but the performance is 30% better. Please note that the total thread count is (and must be) always the same! My GPU is a ATI 7790.

================ EDIT 2 ==================

In order to investigate deeper into your question about CS vs. VS speed I have re-viewed a very interesting channel 9 video (PDC09 presentation, held by Microsoft chief architect Chas Boyd about direct compute, see link below). In this presentation Boyd states that optimizing the thread layout (numthreads) can lead to twofold increase of throughput.

More interesting however is the part of his presentation (starting at minute 40) where he explains the correlation between UAVs and GPU memory layout (“Graphics vs. Compute I/O”). I don’t want to draw wrong conclusions from Boyds statements, but it seems at least possible, that Compute shaders bound via UAVs do have a lower memory bandwidth than other GPU shaders. If this were true we might have an explanation for the fact that UAVs can’t be bound to VS, for example (at least in version 11.0).

Since these memory access patterns also depend on hardware design, you should escalate your question directly to ATI / NVIDIA engineers.

CONCLUSION

I have absorbed tons of information about CS usage, but there was not the slightest indication that CS could run the same algorithm slower than VS. If that is really the case you have detected something that matters for all folks who use direct compute.

link: http://channel9.msdn.com/Events/PDC/PDC09/P09-16