Search code examples
c++gccvectoriteratormax

why is c++ std::max_element so slow?


I need to find the max element in the vector so I'm using std::max_element, but I've found that it's a very slow function, so I wrote my own version and manage to get x3 better performance, here is the code:

#include <string>
#include <iostream>
#include <vector>
#include <algorithm>

#include <sys/time.h>

double getRealTime()
{
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (double) tv.tv_sec + 1.0e-6 * (double) tv.tv_usec;
}

inline int my_max_element(const std::vector<int> &vec, int size)
{
    auto it = vec.begin();
    int max = *it++;
    for (; it != vec.end(); it++)
    {
        if (*it > max)
        {
            max = *it;
        }
    }
    return max;
}

int main()
{
    const int size = 1 << 20;
    std::vector<int> vec;
    for (int i = 0; i < size; i++)
    {
        if (i == 59)
        {
            vec.push_back(1000000012);
        }
        else
        {
            vec.push_back(i);
        }
    }

    double startTime = getRealTime();
    int maxIter = *std::max_element(vec.begin(), vec.end());
    double stopTime = getRealTime();
    double totalIteratorTime = stopTime - startTime;

    startTime = getRealTime();
    int maxArray = my_max_element(vec, size);
    stopTime = getRealTime();
    double totalArrayTime = stopTime - startTime;

    std::cout << "MaxIter = " << maxIter << std::endl;
    std::cout << "MaxArray = " << maxArray << std::endl;
    std::cout << "Total CPU time iterator = " << totalIteratorTime << std::endl;
    std::cout << "Total CPU time array = " << totalArrayTime << std::endl;
    std::cout << "iter/array ratio: = " << totalIteratorTime / totalArrayTime << std::endl;
    return 0;
}

Output:

MaxIter = 1000000012
MaxArray = 1000000012
Total CPU time iterator = 0.000989199
Total CPU time array = 0.000293016
iter/array ratio: = 3.37592

on average std::max_element takes x3 more time then my_max_element. So why am I able to create a much faster std function so easily? Should I stop using std and write my own functions since std is so slow?

Note: at first I though it was because I'm using and integer i in a for loop instead of an iterator, but that seams to not matter now.

Compiling info:

g++ (GCC) 4.8.2

g++ -O3 -Wall -c -fmessage-length=0 -std=c++0x


Solution

  • Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).


    There seems to be an optimizer bug/artifact. Compare the times of:

    template<typename _ForwardIterator, typename _Compare>
    _ForwardIterator
    my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
    _Compare __comp)
    {
      if (__first == __last) return __first;
      _ForwardIterator __result = __first;
    
      while(++__first != __last)
        if (__comp(__result, __first))
          __result = __first;
    
      return __result;
    }
    
    template<typename _ForwardIterator, typename _Compare>
    _ForwardIterator
    my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
    _Compare __comp)
    {
      if (__first == __last) return __first;
      _ForwardIterator __result = __first;
      ++__first;
    
      for(; __first != __last; ++__first)
        if (__comp(__result, __first))
          __result = __first;
    
      return __result;
    }
    

    The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.


    Following Maxim's proposal, changing the vector from int to int64_t, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).


    The difference is in predictive commoning of *result, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:

    w/o commoning     with commoning
    *                 *
    **                 *
     **                 *
      **                 *
      * *                 *
      *  *                 *
      *   *                 *
    

    Here's the asm for comparison (rdi/rsi contain the first/last iterators respectively):

    With the while loop (2.88743 ms; gist):

        movq    %rdi, %rax
        jmp .L49
    .L51:
        movl    (%rdi), %edx
        cmpl    %edx, (%rax)
        cmovl   %rdi, %rax
    .L49:
        addq    $4, %rdi
        cmpq    %rsi, %rdi
        jne .L51
    

    With the for loop (1235.55 μs):

        leaq    4(%rdi), %rdx
        movq    %rdi, %rax
        cmpq    %rsi, %rdx
        je  .L53
        movl    (%rdi), %ecx
    .L54:
        movl    (%rdx), %r8d
        cmpl    %r8d, %ecx
        cmovl   %rdx, %rax
        cmovl   %r8d, %ecx
        addq    $4, %rdx
        cmpq    %rdx, %rsi
        jne .L54
    .L53:
    

    If I force commoning by explicitly storing *result into a variable prev at the start and whenever result is updated, and using prev instead of *result in the comparison, I get an even faster loop (377.601 μs):

        movl    (%rdi), %ecx
        movq    %rdi, %rax
    .L57:
        addq    $4, %rdi
        cmpq    %rsi, %rdi
        je  .L60
    .L59:
        movl    (%rdi), %edx
        cmpl    %edx, %ecx
        jge .L57
        movq    %rdi, %rax
        addq    $4, %rdi
        movl    %edx, %ecx
        cmpq    %rsi, %rdi
        jne .L59
    .L60:
    

    The reason this is faster than the for loop is that the conditional moves (cmovl) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken Hn times, which is a negligible proportion (Hn grows logarithmically, so Hn/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].