c math matrix compiler-optimization calling-convention

Which variant of the matrix 4x4 multiplication function should I use?

I am studying 3d rendering using OpenGL and C, and writing a small mathematical library for the purpose of studying. Is it better to return the result of the matrix multiplication function using a return statement, or by modifying an output matrix via pointer?

typedef float vec_t;

typedef struct mat4_s {
    vec_t m[4][4];
} mat4_t;

void Mat4Mult(mat4_t* out, const mat4_t* in1, const mat4_t* in2) {
    out->m[0][0] = /* ... */;
    out->m[1][0] = /* ... */;
    /* ... */
}

mat4_t Mat4Mult(const mat4_t* in1, const mat4_t* in2) {
    mat4_t result;
    result.m[0][0] = /* ... */;
    result.m[1][0] = /* ... */;
    /* ... */
    return result;
}

I want to understand which option would be more correct. I think both options are correct, but I prefer to return the result of a function using a return statement. Please correct me if I'm wrong, I haven't fully mastered C.

Solution

It is very difficult to answer these questions by intuition, even if you have a mountain of experience. This is why you should try both, and profile the results. Let's compare the following naive 4x4 matrix multiplication functions:

void Mat4Mult_Dest(mat4_t* out, const mat4_t* in1, const mat4_t* in2) {
    for (int i = 0; i < 4; ++i) {
        for (int j = 0; j < 4; ++j) {
            out->m[i][j] = 0;
            for (int k = 0; k < 4; ++k) {
                out->m[i][j] += in1->m[i][k] * in2->m[k][j];
            }
        }
    }
}

mat4_t Mat4Mult_Ret(const mat4_t* in1, const mat4_t* in2) {
    mat4_t out = {0};
    for (int i = 0; i < 4; ++i) {
        for (int j = 0; j < 4; ++j) {
            out.m[i][j] = 0;
            for (int k = 0; k < 4; ++k) {
                out.m[i][j] += in1->m[i][k] * in2->m[k][j];
            }
        }
    }
    return out;
}

Clang 15.0 Results

GCC 12.2 Results

The results vary significantly between GCC and clang. Looking at the assembly, this is probably because clang inlined the _Ret version, but didn't do the same for the _Dest version. GCC inlined both functions, making them perform essentially the same. This is unsurprising, because the two functions are performing the same calculations.

Conclusion

According to the benchmarks, returning by value is at least as fast as writing to a destination matrix. It is more inlining-friendly for some compilers, which may improve performance. However, you could likely achieve the same results by annotating your functions so that they are more likely to be inlined.

It is worth noting that in Mat4Mul_Ret, return out; is writing to a destination in-place anyways, because large objects are passed via destination pointer in the x86_64 ABI:

Mat4Mult_Ret:
// ...
// last 4 instructions move result to destination pointer
movups  xmmword ptr [rdi + 48], xmm1
movups  xmmword ptr [rdi + 32], xmm7
movups  xmmword ptr [rdi + 16], xmm4
movups  xmmword ptr [rdi], xmm3
ret

There is one notable difference between your functions though: mat4_t* out can be aliased by in1 and in2, but a local mat4_t out can not. Consider marking your pointers restrict to give the compiler more optimization freedom.