Search code examples
c++windowsperformanceperformancecounter

Do I need to prevent preemption while measuring performance


I want to measure the performance of block of code with the use of QueryPerformanceCounter in Windows. What I would like to know is whether between different runs I can do something to get equal measurements for the same data (I want to measure the performance of different sorting algorithms on different sizes of arrays containing pod or some custom objects). I know that the current process can be interrupted from execution because of interrupts or I/O operations. I'm not doing any I/O so it's only interrupts that may affect my measurement, I'm assuming that the kernel also has some time frame that allows my process to run, so I think that's gonna schedule away my proc as well.

How do people make accurate measurements through measuring the time of execution of a specific piece of code?


Solution

  • Time measurements are tricky because you need to find out why your algo is slower. That depends on the input data (e.g. presorted data see Why is it faster to process a sorted array than an unsorted array?) or the data set size (fits into L1,L2,L3 cache see http://igoro.com/archive/gallery-of-processor-cache-effects/).

    That can hugely influence your measured times. Also the order of measurements can play a critical role. If you execute the sort alogs in a loop and each of them allocates some memory the first test will most likely loose. Not because the algorithm is inferior but the first time you access newly allocated memory it will be soft faulted into your process working set. After the memory is freed the heap allocator will return pooled memory which will have an entirely different access performance. That becomes very noticeable if you sort larger (many MB) arrays.

    Below are the touch times of a 2 GB array from different threads for the first and second time printed. Each page (4KB) of memory is only touched once.

    Threads Size_MB Time_ms us/Page MB/s    Scenario
    1       2000    355     0.693   5634    Touch 1
    1       2000    11      0.021   N.a.    Touch 2
    2       2000    276     0.539   7246    Touch 1
    2       2000    12      0.023   N.a.    Touch 2
    3       2000    274     0.535   7299    Touch 1
    3       2000    13      0.025   N.a.    Touch 2
    4       2000    288     0.563   6944    Touch 1
    4       2000    11      0.021   N.a.    Touch 2
    
    // Touch is from the compiler point of view a nop operation with no observable side effect 
    // This is true from a pure data content point of view but performance wise there is a huge
    // difference. Turn optimizations off to prevent the compiler to outsmart us.
    #pragma optimize( "", off )
    void Program::Touch(void *p, size_t N)
    {
        char *pB = (char *)p;
        char tmp;
        for (size_t i = 0; i < N; i += 4096)
        {
            tmp = pB[i];
        }
    
    }
    #pragma optimize("", on)
    

    To truly judge the performance of an algorithm it is not sufficient to perform time measurements but you need a profiler (e.g. the Windows Performance Toolkit free, VTune from Intel (not free)) to ensure that you have measured the right thing and not something entirely different.