Search code examples
c++vectorizationsimdiccauto-vectorization

Intel Auto-Vectorization Trip Count Explanation?


I've done quite a bit of thread-level and process-level parallelism and now I'm trying to get into instruction level parallelism with the Intel C++ Compiler which is being quite a challenge.

While doing some auto-vectorization of loops and analysing the compiler logs I found some "Estimate of max trip count of loop" that I can't quite figure out.

Example:

double a[100],x[100],y[100]
...
for (i=0; i< 100; i++) {
   a[i] = x[i] + y[i];
}

This loop outputs a Estimate of max trip count of 12 trips. I read somewhere that the vectorisation process can process a total of 8 elements per trip, as long has the cost of the process of each cycle is less that 6 u-operations, from what i can tell, this example loop has a cost of 1 store, 2 reads and 1 arithmetic operation.

So in theory, my trip count should be 100/8 = 12.5 trips, therefore, 13 trips.

Is this a round up made by the compiler? Or is there any other optimization going on in the background that allows the process to take less than 13 trips?

One more question, is my 6 u-operations per cycle assumption correct? Are there any cases when this does not apply?

Thanks in advance


Solution

  • Rather than get caught up in how Intel implements each loop let's try and answer your question about instruction level parallelism.

    Your operations is limited by the read and writes so you can ignore the arithmetic in determining the number of cycles. Here is what Core2 through Broadwell can do:

    Core2:   two 16 byte reads one 16 byte write per 2 clock cycles     -> 24 bytes/clock cycle
    SB/IB:   two 32 byte reads and one 32 byte write per 2 clock cycles -> 48 bytes/clock cycle
    HSW/BDW: two 32 byte reads and one 32 byte write per clock cycle    -> 96 bytes/clock cycle
    

    The total number of bytes being read and written is sizeof(double)*100*3=2400. So a quick estimate of the time it will take is

    Core2:   2400/24 = 100 clock cycles
    SB/IB:   2400/48 =  50 clock cycles
    HSW/BDW: 2400/96 =  25 clock cycles
    

    Now the question is how to implement this for full bandwidth.

    For Core2 through Ivy Bridge one of the loads can be fused with one of the addition to cost one micro-fused micro-op. The other load costs one micro-op and the load one micro-op. If you want to do this every iteration you need to decrease a pointer and do a conditional jump as well. Since Nehalem, these can macro-fuse so the total number of micro-fused/macro fused operations per iteration is:

                                Core2          Nehalem through Broadwell
    vector add + load               1          1
    vector load                     1          1
    vector store                    1          1
    scalar add                      1          ½
    conditional jump                1          ½  
    --------------------------------------------
    total                           5          4
    

    For Core2 through Ivy Bridge either both loads need the same port or the load and store need the same port. This requires two clock cycles. For Haswell/Broadwell it's possible to due this every clock-cycle. However, due to limitations on port 7 only statically allocated arrays can achieve this using absolute-32 bit address + offset addressing (which incidentally is not possible on OSX). So for Haswell/Broadwell if the array is not statically allocated you either have to unroll the loop to do your operation every clock cycle or it takes 1.5 clock cycles per iteration. Here is a summary of the clock cycles per iteration for each processor:

    Core2:   5 fused micro-ops/every two clock cycles
    SB/IB:   4 fused micro-ops/every two clock cycles
    HSW/BDW: 4 fused mirco-ops/every clock cycle for statically allocated array
    HSW/BDW: 4 fused mirco-ops/every 1.5 clock cycles for non-statically allocated arrays
    

    If you used stack allocated arrays you can probably safely read past the end of the buffer. Otherwise you should pad your arrays to the SIMD width. The number of iterations of the loop is then:

    SSE2: (100+1)/2 = 51
    AVX:  (100+3)/4 = 26
    

    The Intel compiler in my experience unrolls twice so that would half the number of iterations. The number of iterations unrolling twice is

    SSE2: (100+3)/4 = 26
    AVX:  (100+7)/8 = 13
    

    Finally, in terms of clock cycles it's

    Core2:     51*2   = 102 clock cycles
    SB/IB:     26*2   =  51 clock cycles
    HSW/BDW:   26*1.5 =  39 clock cycles for non-statically allocated arrays no-unroll
    HSW/BDW:   26*1   =  26 clock cycles for statically allocated arrays no-unroll
    HSW/BDW:   26*1   =  26 clock cycles with full unrolling