API agnostic row/colum-major matrix representation

Because in D3D/HLSL we use row-vectors (1xN matrices) thus pre-multiplication (vector * matrix), we store the translation part in the 4th row of the matrix:

m00 m01 m02 0
m10 m11 m12 0
m20 m21 m22 0
Tx  Ty  Tz  1

so the transformed x coordinate is x' = x*m00 + y*m10 + z*m20 + 1*Tx. If the matrix contains a translation only then it translates into x' = x*1 + 1*Tx = x + Tx.

Based on the HLSL docs and some experiments in the past, the uniform matrices are loaded by columns, so one register will contain {m00,m10,m20,m30}. And this is good for the vector-matrix multiplication because it translates into 4 dot products (x' = vec-registry dot mat-registry_0) which probably has a single hardware instruction.

On the other hand, OGL/GLSL uses column-vectors and post-multiplication, which means that the translation part is stored in the 4th column instead (transpose of the matrix above). Based on the wiki the "GLSL matrices are always column-major".

Some key points:

In the CPU side of the matrix operations, I'd like to vectorize them
This means that the memory layout is important, so the appropriate columns can be loaded into reigsters fast
Currently my matrix is stored by columns {m00,m10,m20,m30,m01,...,m33} where m30,m31,m32 stores the translation part (I'm using row vectors -> pre-multiplication)
Additionally, I'd like to use the same memory layout for passing uniform data to the graphics API (memcpy into a buffer without transposing the matrix)
I examined the Matrix implementation of the UE4 and I can't see any separation based on the underlying rendering API (which is the expected result)

What is the best way to handle these differences in an API agnostic way?

I'd imagine that I could keep my current matrix implementation (DX-style, row-vectors, pre-multiplication, column-major storage for SSE) and in the GLSL code, I set the row_major layout so it reads data in the other way around. If it does work, what is the performance impact of this, if any?

The main target platforms are Vulkan and D3D11/12.

Solution

Both APIs target the same GPUs, so you can be reasonably sure matrix layout does not matter for performance there. For that matter, Vulkan can consume HLSL if you want.