Search code examples
performancex86intelsseintrinsics

Why is `_mm_stream_si128` much slower than `_mm_storeu_si128` on Skylake-Xeon when writing parts of 2 cache lines? But less effect on Haswell


I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):

__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
  __m128i in = _mm_loadu_si128(inptr);
  __m128i out = in; // real code does more than this, but I've simplified it
  _mm_stream_si12(outptr,out);
  inptr  += 12;
  outptr += 16;
}

This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.

We upgraded to the lasted microcode on the Skylake, and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.

outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.

If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.

So the two questions are

1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?

2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?

I'm a newbie when it comes to intrinsics.


Addendum - test case

Here is the entire test case: https://godbolt.org/z/toM2lB

Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).

// icpc -std=c++14  -msse4.2 -O3 -DNDEBUG ../mre.cpp  -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
// The command line was
//    perf stat ./mre 100000
//
//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     1.65   7.29
//   _mm_storeu_si128     0.41   0.40

The ratio of stream to store is 4x or 18x, respectively.

I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.

Second edit - 64B aligned output

The comment from @Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).

So changed my test code like this:

#if 0
    std::vector<Tile> tiles(outputPixels/32);
#else
    std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif

and now the numbers are quite different:

//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     0.19   0.48
//   _mm_storeu_si128     0.25   0.52

So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.

Third Edit. Purposely misalignment

I tried the test suggested by @HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.

perf stat ./mre1  1000000

To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).

//   STORER               alignment time (seconds)
//                        byte  E5-2680   8180
// ---------------------------------------------------
//   _mm_storeu_si128     16       3.15   2.69
//   _mm_storeu_si128     32       3.16   2.60
//   _mm_storeu_si128     64       1.72   1.71
//   _mm_stream_si128     16      14.31  72.14 
//   _mm_stream_si128     32      14.44  72.09 
//   _mm_stream_si128     64       1.43   3.38

So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.

For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):

template <class T >
struct Mallocator {
  typedef T value_type;
    Mallocator() = default;
      template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept 
{}
        T* allocate(std::size_t n) {
                if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
                    uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
                    if(! p1) throw std::bad_alloc();
                    p1 += 32; // misalign on purpose
                    return reinterpret_cast<T*>(p1);
                          }
          void deallocate(T* p, std::size_t) noexcept {
              uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
              p1 -= 32;
              std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }

...

std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);

Solution

  • The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.

    The actual loop from your godbolt code is:

    while (count > 0)
            {
                // std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
                __m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
                __m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
                __m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
                __m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));
    
                __m128i tileVal0 = value0;
                __m128i tileVal1 = value1;
                __m128i tileVal2 = value2;
                __m128i tileVal3 = value3;
    
                STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
                STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
                STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
                STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);
    
                ptr    += diffBytes * 4;
                count  -= diffBytes * 4;
                tile   += diffPixels * 4;
                ipixel += diffPixels * 4;
                if (ipixel == 32)
                {
                    // go to next tile
                    ipixel = 0;
                    tileIter++;
                    tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
                }
            }
    

    Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.

    On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.

    Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.