How would to write a multiplication of double values in NEON assembly?

The line in question is pretty contained:

w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1]

Considering these variables are double (but I can downgrade to float), I think I can pass one value per register? Would it be more efficient to use the 2x2 matrix W directly?

EDIT 1:

This line is inside a loop that is fired hundreds of times per second and has real-time requirements. Instruments says this line takes 60% of the time of the loop.

EDIT 2:

This is the loop(s) I'm talking about:

for (int x=startingX; x<endingX; ++x)
{
    for (int y=startingY; y<endingY; ++y)
    {
        Matx21d position(x,y);

        // warp patch
        uint8_t *data;
        [self backwardWarpPatchWithWarpingMatrix:warpingMatrix withWarpData:&data withReferenceImage:_initialView withCenter:position];

        // check that the backward patch was successful
        if (!data)
            continue;

        // calculate zero mean (on the patch) sum of squared differences
        int ssd = [self computeZMSSDScoreWithX:x withY:y withCurrentTargetPatch:data];

        if (fabs(ssd) < bestSSD)
        {
            bestPosition = position;
            bestSSD = ssd;
        }
    }
}

backwardWarpPatchWithWarpingMatrix:

Matx22d warpingMatrixInverse = warpingMatrix.inv();
double wmi0 = warpingMatrixInverse(0,0), wmi1 = warpingMatrixInverse(0,1), wmi2 = warpingMatrixInverse(1,0), wmi3 = warpingMatrixInverse(1,1);

if (isnan(wmi0))
{
    warpingMatrixInverse = Matx22d::eye();
}

// Perform the warp on a larger patch.
int LEVEL_REF = 0, halfPatchSize = PATCH_SIZE/2;
Matx21d centerInLevel = center * (1.0 / (1<<LEVEL_REF));
__block Mat warped(PATCH_SIZE, PATCH_SIZE, CV_8UC1);


dispatch_apply(PATCH_SIZE, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t y)
{
    for (int x=0; x<PATCH_SIZE; ++x)
    {
        double pp0 = x - halfPatchSize, pp1 = (double)y - halfPatchSize;
        Matx21d multiplication(wmi0 * pp0 + wmi1 * pp1, wmi2 * pp0 + wmi3 * pp1);
        Matx21d px(multiplication(0) + centerInLevel(0), multiplication(1) + centerInLevel(1));
        double warpedPixel = [self interpolatePointInImage:referenceImage withU:px(0) withV:px(1)];
        warped.at<uchar>(y,x) = (uint8_t)warpedPixel;
    }
});

computeReferencePatchScores:

int x = (int)u;
int y = (int)v;
float   subpixX = u - x,
        subpixY = v - y,
        oneMinusSubpixX = 1.0 - subpixX,
        oneMinusSubpixY = 1.0 - subpixY;


float   w00 = oneMinusSubpixX * oneMinusSubpixY,
        w01 = oneMinusSubpixX * subpixY,
        w10 = subpixX * oneMinusSubpixY,
        w11 = 1.0f - w00 - w01 - w10;

const int stride = (int)image.step.p[0];
uchar* ptr = image.data + y * stride + x;

return w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1];

Solution

You typically don't translate a single line of code into assembly. For it to be worth writing in assembly, you have to first assume that you can generate better assembly than the compiler will. Sometimes that's true for vectorized code on NEON, but it's usually because you have special knowledge about a complex loop. You're unlikely to beat the compiler significantly on a single line of code (and will likely lose). Is this line part of a loop that you've profiled and identified as a major bottleneck? Have you already tried Accelerate? Have you analyzed the assembly the compiler is generating and found mistakes that it's making.

Trying to do this in ObjC++ is very inefficient. ObjC++ is a glue language for tying together C++ and ObjC; doing both in the same file imposes several performance costs, especially with ARC. Calling an ObjC method inside of a performance-critical inner-loop is very expensive in any case (even if there weren't mixed-in C++). You should never do any kind of function call (least of all an ObjC method dispatch) inside of a tight inner-loop. It's not clear where you're actually calling computeReferencePatchScores. The use of GCD here is probably hurting you more than helping (since it prevents the compiler from applying certain vector optimizations).

This is all to say: how a particular line of code is being compiled into assembly is by far the least of your problems in this code. Its structure is fighting clang's optimizer.

Step one is to step back and ask what computation you want to execute, and then read through the Core Image Programming Guide and the vImage Programming Guide and verify that it isn't already available. You might also look over OpenGL ES, but OpenGL is often a whole approach to drawing (so it's a bit more of a commitment). It looks like you're already using OpenCV, so make sure it doesn't have available functions to do what you want. (Most of what I see in there looks like stuff built into both OpenCV and vImage.)

The simplest way to improve performance without moving to more powerful frameworks is to move the entire loop into a single C++ function. Then the optimizer can see all the code and apply vector operations on its own. But the next step is to make use of the high-level high-performance frameworks already available.

In any case, you'll want to sit down and carefully work through exactly the calculations you need to perform (I usually do this by hand on paper). Make sure you're not duplicating anything, that you need every calculation you're performing, and that each change you make still generates the same result.