Search code examples
assemblyperformancecounteravxicc

Why does the number of executed AVX instructions change with processor family


I have implemented a simple AVX program using intrinsic functions, which is compiled with the icc with -march=core-avx2 -O3. The progam does not instrument multithreading.

While profiling the execution of the program I measured the number of actually executed AVX (256-bit Floating point operations) with the PAPI library.

When I execute the programs on different processors (i.e. Core-i7 of Sandy Bridge, Haswell and Skylake) the number of executed instructions is nearly identical for the SB and Skylake architecture but higher (+50%) for the Haswell architecture.

As far as I understand the generated assembler instructions do not differ between the architectures, since -march=native is not used.

Where does the difference in executed and written operations come from? Is there some sort of micro-code emulation for some hardware/instructions. Or is there some architecture specific overcount happening?


Solution

  • The problem should be divided into at least two questions: 1) are the counters supposed to be the same between two runs? 2) can reported numbers be trusted? The first question addresses possible sources of variation in your methodology, the second one addresses specifics of tools you use (PAPI and underlying hardware counters it uses).

    1. Run the same binary on both systems. Not two programs compiled from the same source, but the same binary file copied to both of them. If that brings measurements of AVX instructions together, then the problem is that separate compilations generate different code.

    2. Simplify the program code so that the resulting number of instructions is trivial to guess. Use a loop with hardcoded number of iterations, with a linear code block inside, and PAPI calls just around the loop. This way, you can predict the outcome, and thus compare it with reported numbers. You use intrinsics, so one might assume that compiler optimizations should not affect code generated from them. but reduce the optimization level to -O0 to make sure the compiler uses minimal amount of tricks, such as dynamic processor dispatching.

      Make the program even simpler. Leave only a single AVX instruction outside any loop. What will PAPI report? Leave zero of them. Will PAPI's report still match the expectation?

    These techniques should be enough to logically deduce whether the problem is in discrepancies in build processes of independent binaries, divergence of runtime paths chosen in a single binary on different hardware, incorrect PAPI usage or plain PAPI bugs, possibly caused by underlying hardware being unreliable to count instructions. By the way, you didn't show any code, so it is quite possible that you forgot to initialize something, or have varying number of iterations, or smth similar omission in your approach.