Search code examples
c++cperformanceoptimizationmeasurement

Difficulties to measure C/C++ performance


I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to write it in a language that is common subset between C++ and C, that is standard-compliant for both languages and that is fairly portable. It was tested on different Windows PCs:

#include <stdio.h>
#include <time.h>

/// @return - time difference between start and stop in milliseconds
int ms_elapsed( clock_t start, clock_t stop )
{
    return (int)( 1000.0 * ( stop - start ) / CLOCKS_PER_SEC );
}

int const Billion = 1000000000;
/// & with numbers up to Billion gives 0, 0, 2, 2 repeating pattern 
int const Pattern_0_0_2_2 = 0x40000002; 

/// @return - half of Billion  
int unpredictableIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // true, true, false, false ...
        if ( ( i & Pattern_0_0_2_2 ) == 0 )
        {
            ++sum;
        }
    }
    return sum;
}

/// @return - half of Billion  
int noIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // 1, 1, 0, 0 ...
        sum += ( i & Pattern_0_0_2_2 ) == 0;
    }
    return sum;
}

int main()
{
    clock_t volatile start;
    clock_t volatile stop;
    int volatile sum;
    printf( "Puzzling measurements:\n" );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = noIfs();
    stop = clock();
    printf( "Same without ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );
}

Compiled with VS2010; /O2 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1344 msec; answer was 500000000
Unpredictable ifs took 1016 msec; answer was 500000000
Same without ifs took 1031 msec; answer was 500000000
Unpredictable ifs took 4797 msec; answer was 500000000

Edit: Full switches of compiler:

/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\Trying.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue

Other person posted such ... Compiled with MinGW, g++ 4.71, -O1 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1656 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Same without ifs took 1969 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000

Also he posted such results for -O3 optimizations:

Puzzling measurements:
Unpredictable ifs took 1890 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Same without ifs took 1422 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000

Now I have question. What is going on here?

More specifically ... How can a fixed function take so different amounts of time? Is there something wrong in my code? Is there something tricky with Intel processor? Are the compilers doing something odd? Can it be because of 32 bit code ran on 64 bit processor?

Thanks for attention!

Edit: I accept that g++ -O1 just reuses returned values in 2 other calls. I also accept that g++ -O2 and g++ -O3 have defect that leaves the optimization out. Significant diversity of measured speeds (450% !!!) seems still mysterious.

I looked at disassembly of code produced by VS2010. It did inline unpredictableIfs 3 times. The inlined code was fairly similar; the loop was same. It did not inline noIfs. It did roll noIfs out a bit. It takes 4 steps in one iteration. noIfs calculate like was written while unpredictableIfs use jne to jump over increment.


Solution

  • With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)

    With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.

    Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.

    With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):

    .L12:
        movl    %eax, %ecx
        andl    $1073741826, %ecx
        cmpl    $1, %ecx
        adcl    $0, %edx
        addl    $1, %eax
        cmpl    $1000000000, %eax
        jne .L12
    

    and for noIfs it's similar:

    .L15:
        xorl    %ecx, %ecx
        testl   $1073741826, %eax
        sete    %cl
        addl    $1, %eax
        addl    %ecx, %edx
        cmpl    $1000000000, %eax
        jne .L15
    

    where it was

    .L7:
        testl   $1073741826, %edx
        sete    %cl
        movzbl  %cl, %ecx
        addl    %ecx, %eax
        addl    $1, %edx
        cmpl    $1000000000, %edx
        jne .L7
    

    with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.

    With -O3, the loop for unpredictableIfs becomes worse,

    .L14:
        leal    1(%rdx), %ecx
        testl   $1073741826, %eax
        cmove   %ecx, %edx
        addl    $1, %eax
        cmpl    $1000000000, %eax
        jne     .L14
    

    and for noIfs (including the setup-code here), it becomes better:

        pxor    %xmm2, %xmm2
        movq    %rax, 32(%rsp)
        movdqa  .LC3(%rip), %xmm6
        xorl    %eax, %eax
        movdqa  .LC2(%rip), %xmm1
        movdqa  %xmm2, %xmm3
        movdqa  .LC4(%rip), %xmm5
        movdqa  .LC5(%rip), %xmm4
        .p2align 4,,10
        .p2align 3
    .L18:
        movdqa  %xmm1, %xmm0
        addl    $1, %eax
        paddd   %xmm6, %xmm1
        cmpl    $250000000, %eax
        pand    %xmm5, %xmm0
        pcmpeqd %xmm3, %xmm0
        pand    %xmm4, %xmm0
        paddd   %xmm0, %xmm2
        jne .L18
    
    .LC2:
        .long   0
        .long   1
        .long   2
        .long   3
        .align 16
    .LC3:
        .long   4
        .long   4
        .long   4
        .long   4
        .align 16
    .LC4:
        .long   1073741826
        .long   1073741826
        .long   1073741826
        .long   1073741826
        .align 16
    .LC5:
        .long   1
        .long   1
        .long   1
        .long   1
    

    it computes four iterations at once, and accordingly, noIfs runs almost four times as fast then.