Why sorting an ascending array (1~100,000) with std::sort() is faster than just using for loop 100,000 times

I know that std::sort has a very high performance, as far as I know it uses Introsort (quickSort+insertionSort+heapSort), but in my tests I found that: "sorting an ascending array (1~99999) with std::sort() is faster than just using for loops 100,000 times". Although std::sort is fast, at least it needs to traverse the entire array. I think this is not possible (std::sort is faster than just for loops with the same number of loops and array lengths). I am very confused, who can tell me what is the principle.

It's hard to understand only in MacOS, I also tested it in Linux (Centos 7.6) and the results are expected.I want to know what Mac and Xcode did to it.

Environment：
1. MacBook Pro (MacOS Mojave 10.14.6), Xcode
2. X86_64（Centos7.6）, clang++

Test Code：

#include <iostream>
#include <sys/time.h>
#define LENGTH 100000
int *  order_arr(int lo, int hi, int reverse) {
    int *arr=(int *)malloc(hi<<2);
    if (reverse==0) {
        for (int i = lo; i < hi; ++i) {
            arr[i]=i;
        }
    return arr;
    }else{
        for (int i = lo; i < hi; ++i) {
            arr[i]=hi-1-i;
        }
        return arr;
    }
}

int main(int argc, const char * argv[]) {

    // ---- Create an ascending array: 0~99999
    int * order_array = order_arr(0, LENGTH, 0);
    //------------------------------------------------------------------
    timeval starttime,endtime;
    gettimeofday(&starttime,0);
    //----------------------------------------------------------------------start_time
    // ---- STL sort
//    std::sort(order_array, order_array+LENGTH);

    // ---- Only for loop 100000 times
//    for (int i = 0; i < LENGTH; ++i) ;
    //----------------------------------------------------------------------end_time
    gettimeofday(&endtime,0);
    double timeuse = 1000000*(endtime.tv_sec - starttime.tv_sec) + endtime.tv_usec - starttime.tv_usec;
    std::cout<< (timeuse/=1000000) <<std::endl;

    return 0;
}

Running results:
1. MacOS(Xcode):Unreasonable, with or without optimization, std::sort() sorts the array, this time should not be less than only for loop(without optimization 0.000203 s).
  - optimization: clang++ test.cpp -std=c++11 -o -O3 test
    1. for (int i=0; i<LENGTH; ++i) ; : 0 s
    2. std::sort(order_array, order_array+LENGTH);:0.000118 s
  - No optimization:clang++ test.cpp -std=c++11 -o test
    1. for (int i=0; i<LENGTH; ++i) ; : 0.000203 s
    2. std::sort(order_array, order_array+LENGTH);:0.000123 s
2. Centos7.6(g++):reasonable
  - optimization:clang++ test.cpp -std=c++11 -o -O3 test
    1. for (int i=0; i<LENGTH; ++i) ; :0 s
    2. std::sort(order_array, order_array+LENGTH);:0.001654 s
  - No optimization:clang++ test.cpp -std=c++11 -o -O3 test
    1. for (int i=0; i<LENGTH; ++i) ; :0.0002745 s
    2. std::sort(order_array, order_array+LENGTH);:0.002354 s

Solution

Here is a possible explanation:

You do not use the contents of the sorted array. clang expands the initialization and the template code inline and can determine that you are discarding the array, so it does not even generate the code to sort it, resulting in faster time than the alternative where it does not discard the explicit empty loop.

Try and make main() return the first element of the array to see if it makes a difference.

With your updated question, there seems to be no real issue:

the timings for optimized builds are consistent, with no time spent in the empty loop and a short time spent sorting the already sorted array.
the timings for the unoptimized builds are essentially irrelevant as the core of the template code might still be optimized while the simple loop is compiled into naive inefficient code.

You seem surprised by the performance of std::sort() on an already sorted array on MacOS. It is possible that sorting is very efficient there on an already sorted array, both in increasing and decreasing order. An initial scan is used to split the array in chunks. With your dataset, the initial scan quickly yields a single chunk that is left as is or is simply reversed.

You can try and analyze the template code, which is available on both platforms either directly in the include files or in the open source libraries.