Handling data too narrow for the SIMD loop?

What is the best way to handle the leftover part of a row of data that's too small to fill the registers?

Consider an AVX512 loop working on 32-bit pixel data:

fnAVX512(npixels) {
    while (npixels >= 16) {process_row; npixels -= 16;}
}

When this finishes, npixels might not be zero. Or the function might have been called with narrow data in the first place.

I can think of three possible solutions:

Wrangle the data into the registers anyway by performing the load with a mask that excludes data outside of current interest. This might get messy, especially if the initial data wasn't wide enough to begin with: you can't backtrack the pointer to ensure you're in a "safe zone". Does, for example, _mm256_maskz_loadu_epiN guarantee not reading from memory locations that correspond to 0 mask bits, or could it cause an exception even if the mask is set up safely?
Fall back to scalar without regard for how many pixels are left; in this case anywhere between 15 to 1? This is easy, but feels wrong; the scalar code looks bad in comparison and performs accordingly. (In my case it's around 10% of the speed of AVX.)
If you already have 128 and 256-bit vector code for supporting older architectures, you could do fall through those:

fnAVX512(npixels) {
    while (npixels >= 16) {process_row; npixels -= 16;}
    fnAVX2(npixels);
}

fnAVX2(npixels) {
    while (npixels >= 8) {process_row; npixels -= 8;}
    fnSSE(npixels)
}
...

And ultimately, with 3 or fewer pixels, do scalar. On the topic of trying to avoid that (and maybe this should be a separate question): how is MMX in 2023? Even though it doesn't have an extensive set of instructions, it should(?) be useful for certain types of data, such as 8-bit color channels packed into 32-bit integers, where scalar code needs a large number of operations.

Solution

If your problem allows it (pure vertical SIMD so reprocessing the same element twice is ok), a final vector that ends at the end of the array is good. It might or might not overlap with earlier data. If you're copying the result to a separate destination, this works very well, as long as your inputs are always wider than one vector. Otherwise it can take some care to get right and efficient when operating in-place.

You can use 8-byte or 4-byte loads / stores with XMM registers as part of a cleanup strategy, like SSE2 movq xmm, [rdi]. No need to involve MMX and need to run a slow emms instruction!

The recent intrinsics like _mm_storeu_si64 and _mm_loadu_si32 with a void* operand are cleaner wrappers than earlier intrinsics for the same instructions. _mm_loadl_epi64(__m128i *) exists for movq, but there's no older intrinsic for movd. (Perhaps back in the bad old days, Intel thought you should use _mm_cvtsi32_si128 after dereferencing an int*, because their compiler (ICC) and MSVC don't care about strict-aliasing, and maybe also not alignment UB?)

Beware that some GCC versions have a broken definition for _mm_loadu_si32, shuffling the data to the high 32 bits after loading. See _mm_loadu_si32 not recognized by GCC on Ubuntu - fortunately they also fixed the strict-aliasing and alignment UB bugs when fixing the more obvious bug as well, so there aren't any GCC versions that have a silently-unsafe version that appears to work.

SIMD masked loads and stores like AVX-512 _mm256_maskz_loadu_epiN do suppress faults from masked elements in unmapped pages. With only AVX2, you only have 32-bit or 64-bit masking granularity, and need a vector mask. (Except for SSE2 _mm_maskmoveu_si128; it's implementation-dependent whether it suppresses faults. Also, it's an NT store bypassing cache and evicting, and partial-line NT stores suck. It's generally not useful on modern CPUs because it's also slow even in the best case. It's only available as a store.)

If you have control of how you allocate your buffers, you can round up allocation sizes to a multiple of the vector width to make things easier.

For images, it might not matter if you load a vector that has some pixels from the end of one row, some from the start of the next. But if you do need to do something different for each row, it can make some sense to pad the storage geometry so the stride between rows is a multiple of the vector width. i.e. have some padding pixels at the end of each row, if the actual image width isn't a multiple of 4 pixels / 16 bytes. With wide vectors and unrolled loops, this could waste a lot of cache footprint, so maybe only pad up to a multiple of 16 bytes, and have your loop handle odd sizes down to a multiple of 16 bytes but not narrower.