Search code examples
c++compiler-optimizationloop-unrolling

Is there a standard way to apply an expression to multiple elements in an array, like an unrolled loop?


I've been wondering if there is any standard in C++, that allows you to expand static expression, based on the arrays or some other predefined sets of data, to N rows of code with their internal data.

Example: if there is a need to sum up 5 adjacent matrix cells I would write a function like this:

int sumPlus(const vector<vector<int>>& matrix, int atX, int atY) {
    static const pair<int,int> positions[5] = {{0,0},{0,1},{0,1},{-1,0},{0,-1}};

    int sum = 0;
    for(int i = 0;i < 5;++i) {
        sum += matrix[atY + positions[i].first][atX + positions[i].second];
    }

    return sum;
}

but as I know, it will be translated to code with jumps (jmp or jle instructions) which is slow code. So the question itself: is there any standard that will provide the syntax, like this:

int sumPlus(const vector<vector<int>>& matrix, int atX, int atY) {
    expression_array pair<int,int> positions[5] = {{0,0},{0,1},{0,1},{-1,0},{0,-1}};

    int sum = 0;
    expression<positions : position>(sum += matrix[atY + position.first][atX + position.second]);

    return sum;
}

which will be expanded later into this by the compiler:

int sumPlus(const vector<vector<int>>& matrix, int atX, int atY) {
    int sum = 0;

    sum += matrix[atY + 0][atX + 0];
    sum += matrix[atY + 1][atX + 0];
    sum += matrix[atY + (-1)][atX + 0];
    sum += matrix[atY + 0][atX + 1];
    sum += matrix[atY + 0][atX + (-1)];

    return sum;
}

Solution

  • Loop unrolling compiler optimization

    Both versions produce identical assembly (when made equivalent), see https://godbolt.org/z/7Kf59r6o8

    The optimization that you're missing is called "loop unrolling". The compiler already fully or partially unrolls loops, meaning that code such as

    for(int i = 0; i < 5; ++i) { /* do stuff */; }
    

    ... is automatically translated into

    /* do stuff where i = 0 */;
    /* do stuff where i = 1 */;
    /* do stuff where i = 2 */;
    /* do stuff where i = 3 */;
    /* do stuff where i = 4 */;
    

    Of course, it only happens if the compiler deems this worth it, which depends on heuristics. Clang is somewhat more aggressive in loop unrolling than GCC. Just let the compiler figure it out, or, if you have done profiling and can know that the compiler makes a bad decision, you can improve it by forcing loop unrolling:

    #pragma clang loop unroll_count(2)
    

    Personally, I would have written

    // I have faith that this loop gets unrolled.
    for (int i = -1; i <= 1; ++i) {
        // I have faith that this loop gets unrolled too.
        for (int j = -1; j <= 1; ++j) {
            // I have faith that this if statement is optimized out.
            if (i == 0 || j == 0) {
                sum += matrix[atY + i][atX + j];
            }
        }
    }
    

    If you insist ...

    expression<positions : position>(sum += matrix[atY + position.first][atX + > position.second]);
    

    To get something along these lines, you could create some monstrosity made of std::index_sequence and fold expressions, but it's not worth it here. In the end, it would look something like

    int sum(const vector<vector<int>>& matrix, int atX, int atY) {
        static constexpr int positions[5][2] = {{0,0},{0,1},{1,0},{-1,0},{0,-1}};
    
        return [atX, atY]<std::size_t... I>(std::index_sequence<I...>) {
            return (matrix[atY + position[I][0]][atX + position[I][1]] + ... + 0);
        }(std::make_index_sequence<I...>{});
    }
    

    Anyhow, don't do this. It's a purely academic code sample.