I'm exploring using a compute shader to apply bone deformation to mesh vertices rather than a vertex shader with stream output. I've found the compute shader executes far slower than the vertex shader but before I write it off, I want to be sure I'm not doing something wrong.
With my test data of 100,000 vertices and 1,000 frames of animation data for 300 bones, the vertex shader runs in around 0.22ms while the compute shader takes 4x as long at 0.85ms. The timing is done via D3D API timer queries (rather than a cpu timer).
deform_structs.hlsl
struct Vertex {
float3 position : POSITION;
float3 normal : NORMAL;
float2 texcoord : TEXCOORD;
float3 tangent : TANGENT;
float4 color : COLOR;
};
struct BoneWeights {
uint index;
float weight;
};
StructuredBuffer<matrix> g_bone_array : register(t0);
Buffer<uint> g_bone_offsets : register(t1);
Buffer<uint> g_bone_counts : register(t2);
StructuredBuffer<BoneWeights> g_bone_weights : register(t3);
bone_deform_cs.hlsl
#include "deform_structs.hlsl"
StructuredBuffer<Vertex> g_input_vertex : register(t4);
RWStructuredBuffer<Vertex> g_output_vertex : register(u0);
[numthreads(64,1,1)]
void BoneDeformCS(uint id : SV_DispatchThreadID) {
Vertex vert = g_input_vertex[id.x];
uint offset = g_bone_offsets[id.x];
uint count = g_bone_counts[id.x];
matrix bone_matrix = 0;
for (uint i = offset; i < (offset + count); ++i) {
BoneWeights weight_info = g_bone_weights[i];
bone_matrix += weight_info.weight * g_bone_array[weight_info.index];
}
vert.position = mul(float4(vert.position,1), bone_matrix).xyz;
vert.normal = normalize(mul(vert.normal, (float3x3)bone_matrix));
vert.tangent = normalize(mul(vert.tangent, (float3x3)bone_matrix));
g_output_vertex[id.x] = vert;
}
bone_deform_vs.hlsl
#include "deform_structs.hlsl"
void BoneDeformVS(uint id : SV_VertexID, Vertex vsin, out Vertex vsout) {
uint offset = g_bone_offsets[id];
uint count = g_bone_counts[id];
matrix bone_matrix = 0;
for (uint i = offset; i < (offset + count); ++i) {
BoneWeights bone_info = g_bone_weights[i];
bone_matrix += bone_info.weight * g_bone_array[bone_info.index];
}
vsout.position = mul(float4(vsin.position,1), bone_matrix).xyz;
vsout.normal = normalize(mul(vsin.normal, (float3x3)bone_matrix));
vsout.tangent = normalize(mul(vsin.tangent, (float3x3)bone_matrix));
vsout.texcoord = vsin.texcoord;
vsout.color = vsin.color;
}
Comparing the contents of the buffers once they've run, they are identical and contain the expected values.
I suspect that maybe I'm executing the compute shader incorrectly, spawning too many threads? Do I have the number I pass to Dispatch
wrong? Since it is a 1 dimensional row of data, it made sense to me to use [numthreads(64,1,1)]
. I've tried various values from 32-1024. 64 seems to be the sweet spot as it's the minimum needed for efficient use of AMD GPUs. Anyway. When I call Dispatch
, I ask it to execute (vertex_count / 64) + (vertex_count % 64 != 0) ? 1 : 0
. For 100,000 vertices, the call ends up being Dispatch(1563,1,1)
.
ID3D11ShaderResourceView * srvs[] = {bone_array_srv, bone_offset_srv,
bone_count_srv, bone_weights_srv,
cs_vertices_srv};
ID3D11UnorderedAccessView * uavs[] = {cs_output_uav};
UINT srv_count = sizeof(srvs) / sizeof(srvs[0]);
UINT uav_count = sizeof(uavs) / sizeof(uavs[0]);
UINT thread_group_count = vertex_count / 64 + (vertex_count % 64 != 0) ? 1 : 0;
context->CSSetShader(cs, nullptr, 0);
context->CSSetShaderResources(0, srv_count, srvs);
context->CSSetUnorderedAccessViews(0, uav_count, uavs);
context->Dispatch(thread_group_count, 1, 1);
And this is how the vertex shader is executed:
ID3D11ShaderResourceView * srvs[] = {bone_array_srv, bone_offset_srv,
bone_count_srv, bone_weights_srv};
UINT srv_count = sizeof(srvs) / sizeof(srvs[0]);
UINT stride = 0;
UINT offset = 0;
context->GSSetShader(streamout_gs, nullptr, 0);
context->VSSetShader(vs, nullptr, 0);
context->VSSetShaderResources(0, srv_count, srvs);
context->SOSetTargets(1, &vs_output_buf, &offset);
context->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_POINTLIST);
context->IASetInputLayout(vs_input_layout);
context->IASetVertexBuffers(0, 1, &vs_vertices, &stride, &offset);
context->Draw(vertex_count, 0);
Or is the answer just that reading from a shader resource view and writing to an unordered access view is just far slower than reading from a vertex buffer and writing to a stream output buffer?
I’m just learning how to work with compute shaders, so I’m not an expert. Regarding your bone calculation I’m sure that the CS should work at least as fast as the VS. Intuition tells me that numthreads (64,1,1)
is less efficient than something like numthreads (16,16,1)
.
So you could give this approach a try:
size = ceil (sqrt (numvertices))
ch(size / 16, size / 16)
in your program and numthreads (16,16,1)
in your hlsl filesize
and numvertices
valuesid.x
as index, you calculate your own (linear) index as int index = id.y * size +id.x)
, (maybe id.xy is also possible as index)In most cases size * size
will be greater than numvertices
, so you’ll end up with more threads than vertices. You can block these extra threads by adding a condition in your hlsl function:
int index = id.y * size +id.x;
if (index < numvertices) { .. // your code follows
I hope that this approach speeds up your CS calculations.
================ EDIT ==================
My suggestion was based on my own timing tests. In order to verify my case I repeated these tests with more variances of the numthreads parameters. I calculate the mandelbrot set over 1034 x 827 = 855,118 pixels. Here the results:
numthreads Dispatch groups threads/ total
x y fps x y group threads
4 4 240 259 207 53445 16 855118
8 8 550 129 103 13361 64 855118
16 16 600 65 52 3340 256 855118
32 32 580 32 26 835 1024 855118
64 1 550 16 827 13361 64 855118
256 1 460 4 827 3340 256 855118
512 1 370 2 827 1670 512 855118
As you can see, the sweet spot - numthreads(16,16,1) - creates the same #of thread groups (3340) as numthreads(256,1,1), but the performance is 30% better. Please note that the total thread count is (and must be) always the same! My GPU is a ATI 7790.
================ EDIT 2 ==================
In order to investigate deeper into your question about CS vs. VS speed I have re-viewed a very interesting channel 9 video (PDC09 presentation, held by Microsoft chief architect Chas Boyd about direct compute, see link below). In this presentation Boyd states that optimizing the thread layout (numthreads) can lead to twofold increase of throughput.
More interesting however is the part of his presentation (starting at minute 40) where he explains the correlation between UAVs and GPU memory layout (“Graphics vs. Compute I/O”). I don’t want to draw wrong conclusions from Boyds statements, but it seems at least possible, that Compute shaders bound via UAVs do have a lower memory bandwidth than other GPU shaders. If this were true we might have an explanation for the fact that UAVs can’t be bound to VS, for example (at least in version 11.0).
Since these memory access patterns also depend on hardware design, you should escalate your question directly to ATI / NVIDIA engineers.
CONCLUSION
I have absorbed tons of information about CS usage, but there was not the slightest indication that CS could run the same algorithm slower than VS. If that is really the case you have detected something that matters for all folks who use direct compute.