HLSL 5.0 float1x3 vs float3x1 constant buffer packing rule

I'm currently trying to get my head around constant buffer packing rules in HLSL 5.0 and D3D11. So I played a little with fxc.exe:

// Generated by Microsoft (R) HLSL Shader Compiler 6.3.9600.18773
//
//
// Buffer Definitions:
//
// cbuffer testbuffer
// {
//
//   float foo;                         // Offset:    0 Size:     4
//   float3x1 bar;                      // Offset:    4 Size:    12 [unused]
//
// }
//
//
// Resource Bindings:
//
// Name                                 Type  Format         Dim Slot Elements
// ------------------------------ ---------- ------- ----------- ---- --------
// testbuffer                        cbuffer      NA          NA    0        1

So far everything behaves like I expect it to. The float3x1 is 12 bytes in size and can therefore be placed in the first 16 byte slot since the variable before is 4 bytes in size. After changing the float3x1 to float1x3 the compiler output now looks like this:

// Generated by Microsoft (R) HLSL Shader Compiler 6.3.9600.18773
//
//
// Buffer Definitions:
//
// cbuffer testbuffer
// {
//
//   float foo;                         // Offset:    0 Size:     4
//   float1x3 bar;                      // Offset:   16 Size:    36 [unused]
//
// }
//
//
// Resource Bindings:
//
// Name                                 Type  Format         Dim Slot Elements
// ------------------------------ ---------- ------- ----------- ---- --------
// testbuffer                        cbuffer      NA          NA    0        1

So it seems that the HLSL compiler suddenly gives every float in the float1x3 its own 16 byte slot which is quite wasteful. I googled a lot to understand this behavior but couldn't find anything. I hope some of you guys can explain this to me since this behavior really confuses me.

Solution

This answer is conjecture based on my understanding of HLSL, which uses column major matrix packing by default. The registers in HLSL are made up of sets of four 4-byte sections for a total of 16 bytes per register. Each register then acts as a single row with four columns.

When you declare a float3x1, you are declaring a matrix with 3 columns and one row. This fits neatly into HLSL's method of register packing where a single row can contain 16 bytes.

When you declare a float1x3, you are declaring a matrix with one column and three rows. Because of the way HLSL handles register packing, it has to spread the data across 3 sets of registers and reserves the space of a 3x3 matrix.

If you need a 1xX matrix, you are better off declaring a vector instead which will automatically fit within a single register and can be used in any situation either a 1x3 or a 3x1 matrix could be.