Search code examples
ioscarmvectorizationneon

How could I vectorize this for loop?


I have this loop

void f1(unsigned char *data, unsigned int size) {
    unsigned int A[256] = {0u};      
    for (register unsigned int i = 0u; i < size; i++) {
        ++A[data[i]];
    }
   ...

Is there any way to vectorize it manually?


Solution

  • Since multiple entries in data[i] might contain the same value, I don't see how this could be vectorized simply since there can be race conditions. The point of vectorization is that each element is independent of the other elements, and so can be computed in parallel. But your algorithm doesn't allow that. "Vectorize" is not the same thing as "make go faster."

    What you seem to be building here is a histogram, and iOS has built-in, optimized support for that. You can create a single-channel, single-row image and use vImageHistogramCalculation_Planar8 like this:

    void f1(unsigned char *data, unsigned int size) {
        unsigned long A[256] = {0u};
    
        vImage_Buffer src = { data, 1, size, size };
        vImage_Error err = vImageHistogramCalculation_Planar8(&src, A, kvImageDoNotTile);
        if (err != kvImageNoError) {
            // error
        }
        ...
    }
    

    Be careful about assuming this is always a win, though. It depends on the size of your data. Making a function call is very expensive, so it can take several million bytes of data to make it worth it. If you're computing this on smaller sets than that, then a simple, compiler-optimized loop is often the best approach. You need to profile this on real devices to see which is faster for your purposes.

    Just make sure to allow the compiler to apply all vectorizing optimizations by turning on -Ofast (Fastest, Aggressive). That won't matter in this case because your loop can't be simply vectorized. But in general, -Ofast allows the compiler to apply vectorizing optimizations in cases that it might slightly grow code size (which isn't allowed under the default -Os). -Ofast also allows a little sloppiness in how floating point math is performed, so should not be used in cases where strict IEEE floating point conformance is required (but this is almost never the case for iOS apps, so -Ofast is almost always the correct setting).