How to avoid `out` parameter error when using intrinsics?

I am trying out the new hardware intrinsics added to .NET Core 3.0, specifically in order to accelerate operations on matrices. For matrix addition, I have a function, which takes two 4x4 float matrices as in parameters, and a third out matrix to store the results in. It uses the SSE 128-bit vector intrinsics to add and store the results in the output:

public unsafe static void Add(in Matrix l, in Matrix r, out Matrix o)
{
    fixed (float* lp = &l.m00, rp = &r.m00, op = &o.m00)
    {
        var c1 = Sse.Add(Sse.LoadVector128(lp + 0),  Sse.LoadVector128(rp + 0));
        var c2 = Sse.Add(Sse.LoadVector128(lp + 4),  Sse.LoadVector128(rp + 4));
        var c3 = Sse.Add(Sse.LoadVector128(lp + 8),  Sse.LoadVector128(rp + 8));
        var c4 = Sse.Add(Sse.LoadVector128(lp + 12), Sse.LoadVector128(rp + 12));
        Sse.Store(op + 0,  c1);
        Sse.Store(op + 4,  c2);
        Sse.Store(op + 8,  c3);
        Sse.Store(op + 12, c4);
    }
}

Now obviously the C# compiler has an issue with this, because it can't tell that the output matrix is ever written to, so it generates the error that the function cannot return until the o variable is assigned to. My question is if there is any way around this, without having to resort to assigning to the variable before performing the intrinsics operations, such as o = default; as the first line in the function.

I originally considered something along the lines of:

var op = stackalloc float[16];
fixed (float* lp = &l.m00, rp = &r.m00)
{
...
}
o = *(Matrix*)op;

but realized this doesn't avoid copying the struct, which removes the whole point of passing the matrix as an out.

I realize that this would work if I passed the output Matrix as ref instead, or if I just returned a matrix instance from the function, but it would be nice to keep the helpful inline syntax (Matrix.Add(l, r, out Matrix o)) and performance benefits from passing around large value types by reference.

Solution

I'm assuming here that you are using a Matrix type that is a struct. Obviously, if it were a reference type, then your method would in fact have to initialize the parameter value before you could use it, so the fact that your code doesn't indicates to me that it's a value type.

The C# compiler cannot be made to ignore compile-time errors. And it's a compile-time error to not initialize an out parameter before the method returns. So you are stuck.

That said, I don't think this should be a significant hardship. You can write your method as so:

public unsafe static void Add(in Matrix l, in Matrix r, out Matrix o)
{
    o = default(Matrix);

    fixed (float* lp = &l.m00, rp = &r.m00, op = &o.m00)
    {
        var c1 = Sse.Add(Sse.LoadVector128(lp + 0),  Sse.LoadVector128(rp + 0));
        var c2 = Sse.Add(Sse.LoadVector128(lp + 4),  Sse.LoadVector128(rp + 4));
        var c3 = Sse.Add(Sse.LoadVector128(lp + 8),  Sse.LoadVector128(rp + 8));
        var c4 = Sse.Add(Sse.LoadVector128(lp + 12), Sse.LoadVector128(rp + 12));
        Sse.Store(op + 0,  c1);
        Sse.Store(op + 4,  c2);
        Sse.Store(op + 8,  c3);
        Sse.Store(op + 12, c4);
    }
}

This will compile to something like this (I picked an arbitrary Matrix type for the sake of the example…it's obviously not the one you're using, but the basic premise is the same):

IL_0000:  ldarg.0
IL_0001:  initobj    System.Windows.Media.Matrix

Which in turn will simply initialize the block of memory to 0 values:

The initobj instruction initializes each field of the value type specified by the pushed address (of type native int, &, or *) to a null reference or a 0 of the appropriate primitive type. After this method is called, the instance is ready for a constructor method to be called. If typeTok is a reference type, this instruction has the same effect as ldnull followed by stind.ref.

Unlike Newobj, initobj does not call the constructor method. Initobj is intended for initializing value types, while newobj is used to allocate and initialize objects.

In other words, initobj, which is what you get when you use default(Matrix), is a very simple initialization, merely zeroing out the memory location. It should be fast enough, and in any case is obviously less overhead than allocating a whole new copy of the object and then copying the result back to the original variable, whether that's done locally or via return value.

All that said, this depends a lot on the context of how you're going to call the method. While you say that you would like to preserve the convenience of inline declaration, it's not clear to me why you would want that for a method that is apparently performance-critical enough to be using SSE features and unsafe code. With inline declaration, you necessarily are going to have to reinitialize the variable with each call.

If this method is actually being called in a performance-critical way, then to me that implies it's in a loop being called a large number of times, possibly millions or more. In that situation, you may prefer the ref option, where you can initialize the variable outside you loop, and then just reuse that variable for each call, rather than redeclaring a new variable for each call.