Search code examples
c++loopsoptimizationloop-unrolling

How to implement base 2 loop unrolling at run-time for optimization purposes


Consider some code that needs to executed repeatedly anywhere between 1-1,000,000 times, and that the number of repeats is not known at compile time. It is my understanding that loop unrolling would be a negligible optimization given the large number of loops and would only be optimizing up to the max_unrolls specified at compile time. The idea that I have come up with is to implement a binary (or base 2) partial loop un-roller that essentially executes some_function repeatedly a number of times that is specified at run-time. I have come up with some code to demonstrate the idea, an condensed version is show below. The method used in the code below has a number of issues from a usability perspective.

  1. It requires the coder to manually copy out the base 2 unroll essentially copy out the unroll 2^n-1 times.
  2. This also needs to be re-done for every new function that needs to use this method.

My question is threefold. Firstly am I missing something, is the compiler already intelligent enough to do this itself? Secondly what is an efficient way of implementing this so that it becomes feasible to benchmark this against a standard for loop, whilst solving the above issues. Thirdly to your knowledge is there a library that has already implemented this.

Please note: I am doing this purely for the fun of it, but do not have the expertise to know if this will be effective. I have done testing on the code but have only found very small improvements, however I believe I have not been able to unroll manually far enough for a fair comparison to be made. Also I am aware that this method has the potential to create massive binary sizes, however I believe this would be a worthwhile time memory trade off. Also, if you post any assembly it will probably take me another year or so to understand it.

inline void some_reapeated_function(int &function_parameter_1, int &function_parameter_2)
{
    function_parameter_1 /= function_parameter_2;
}

// Function to be called when you need it to be unrolled.
int runtime_unroll(unsigned int &no_of_loops, int &function_parameter_1, int &function_parameter_2)
{
    std::vector<bool> binary_vector;

    // Stores the number of loops in a binary representation in a vector.
    binary_function.reserve(no_of_loops);
     while(no_of_loops) 
    {
        if (no_of_loops&1)
          binary_vector.push_back(false);
        else
          binary_vector.push_back(true);
        no_of_loops>>=1;
    } 

    // If binary of no_of_loops contains a 2^0 execute once.
    if (binary_vector[0])
    {
        some_reapeated_function(function_parameter_1,function_parameter_2);
    }
    // If binary of no_of_loops contains a 2^1 execute twice.
    if (binary_vector[1])
    {
        some_reapeated_function(function_parameter_1,function_parameter_2);
        some_reapeated_function(function_parameter_1,function_parameter_2);
    }
    //If binary of no_of_loops contains a 2^2 execute 4 times.
    if (binary_vector[2])
    {
        some_reapeated_function(function_parameter_1,function_parameter_2);
        some_reapeated_function(function_parameter_1,function_parameter_2);
        some_reapeated_function(function_parameter_1,function_parameter_2);
        some_reapeated_function(function_parameter_1,function_parameter_2);
    }


    /* This example only covers from 1 to 2^3-1 or up to 7 unrolls. 
    This can continue depending on the number of repetitions needed and 
    could incorporate a for loop to continue after the loop has fully unrolled */
}

Solution

  • You can easily implement something like this with C++ templates. Note that you are still at mercy of your compiler: there is no guarantee all the function calls would get inlined. If they are not, you can try using __forceinline keyword (or its equivalent).

    First of all, you need an unroller which takes a function as argument and executes it K times in a fully unrolled loop. The function call must be inlined, so you have to use functor objects instead of function pointers or std::function-s, and functor's type must be template. The unroller itself can be implemented as a recursive loop by integer template argument. Since functions in C++ cannot have partial template specialization, we have to make our unroller a template class. Here is sample code:

    // execute UnrollCnt times in unrolled fashion
    template<int UnrollCnt, class Functor> struct Unroller {
        static inline void Run(int base, const Functor &func) {
            func(base);
            Unroller<UnrollCnt - 1, Functor>::Run(base + 1, func);
        }
    };
    template<class Functor> struct Unroller<0, Functor> {
        static inline void Run(int base, const Functor &func) {
        }
    };
    

    Given the unroller, we can easily implement the unrolled loop. If we have N iterations, then we can call our unroller [N/K] times, then perform a few remaining calls as usual. Note that functor's type must still be template here. Here is the code:

    // execute with argument in range [begin, end)
    template<int UnrollCnt, class Functor>
    void UnrolledFor(int begin, int end, const Functor &func) {
        // iterations with unrolling
        int iter = begin;
        for (; iter <= end - UnrollCnt; iter += UnrollCnt)
            Unroller<UnrollCnt, Functor>::Run(iter, func);
        // last iterations without unrolling
        for (; iter < end; iter++)
            func(iter);
    }
    

    Now we can call the UnrolledFor loop for any function accepting a single argument, being the loop's iteration count. For example, we can compute sum of numbers from 0 to N-1:

    long long ans = 0;
    int main() {
        int cnt = 0;
        scanf("%d", &cnt);
        int start = clock();
        // note: passing a lambda function here, requesting 8x unrolling
        UnrolledFor<8>(0, cnt, [](int i) {
            ans += i;
        });
        int elapsed = clock() - start;
        printf("%lld   (%d pg)\n", ans, elapsed);
        return 0;
    }
    

    Note however, that manual unrolling may work much faster, because the thick level of abstraction here is not trivial to cut through for the compiler. For example, here are some timings I observe for the sample code (with N = 2000000000):

    With MSVC 2013 x64:
    1999999999000000000   (421 pg)   // 8x unrolling, ans is global
    1999999999000000000   (1389 pg)  // 1x unrolling, ans is global
    1999999999000000000   (4664 pg)  // 8x unrolling, ans is local
    1999999999000000000   (1388 pg)  // 1x unrolling, ans is local
    With MinGW GCC 5.1.0 x64:
    1999999999000000000   (1388 pg)  // 1x unrolling, ans is global
    1999999999000000000   (1404 pg)  // 8x unrolling, ans is global
    1999999999000000000   (1389 pg)  // 1x unrolling, ans is local
    1999999999000000000   (1393 pg)  // 8x unrolling, ans is local
    

    As you see, only MSVC with global ans variable did really win from unrolling. But with local ans variable (captured by reference) it got several times slower instead.

    So if you are really obsessed with performance, I suggest using macros for loop unrolling, they add absolutely no overhead.