simd find first element greater than x

I'm learning to use SIMD in c++, and this is my attempt at implementing a SIMD version of "find the first element greater or equal to X".

My questions:

Can the reinterpret_cast be replaced with some intrinsic?
Is it better to make sure the content reinterpret_cast is aligned, by starting in a similar way to the end, and using non-simd logic for elements upto the first aligned short?

// reference function of desired behavior
inline auto find_first_greater_or_equal_simple(const std::vector<short>& v, short insert)
{
    for (auto it = v.begin(), end = v.end(); it != end; ++it)
    {
        if (*it >= insert) return it;
    }
    return v.end();
}

// simd version
inline auto find_first_greater_or_equal_simd(const std::vector<short>& v, short insert)
{
    const __m512i target = _mm512_set1_epi16(insert);

    auto it = v.begin();
    const auto end = v.end();
    const auto end_simd = it + 32 * ((end - it) / 32);

    __mmask32 cmpge_mask{};

    for (; it != end_simd &&
        !(cmpge_mask = _mm512_cmpge_epi16_mask(*reinterpret_cast<const __m512i*>(&*it), target));
        it += 32)
    {}

    if (cmpge_mask)
    {
        unsigned long local_idx;
        _BitScanForward(&local_idx, cmpge_mask); // todo: __builtin_ctz 
        return it + local_idx;
    }

    for (; it != end; ++it)
    {
        if (*it >= insert) return it;
    }

    return v.end();
}

Solution

Normally I just use pointer math with SIMD intrinsics, or &v[i], rather than .begin() / .end() iterators. SIMD usage depends on contiguous storage of elements so we're not gaining any generality to collections where the iterator isn't equivalent to a const int*.

Your reinterpret_cast is equivalent to _mm512_load_si512(it), which is the alignment-required version. (In optimized builds, the compiler will fold the load into a memory source operand for vpcmpd which doesn't enforce alignment.) If your pointer might not be aligned, use _mm512_loadu_si512.

For vector widths narrower than 512, the __m128i / __m256i load/store intrinsics have less convenient definitions that don't take void* args, so you need _mm256_loadu_si256( (const __m256i*) ptr_expression ). Intel changed over to void* for intrinsics introduced since about 2015, which includes everything new with AVX-512, but didn't retroactively change older intrinsics so we still need these noisy casts all over the place for working with integer vectors.

Yes, on Intel CPUs especially it's a good idea to align your pointers when using 512-bit vectors. Ideally you can just use an aligned allocator for your std::vector so the data's always aligned. (If you only ever check from the start of the std::vector, not from some starting-point in the middle.) Or if your data is usually aligned, often enough to make it not worth the overhead of any extra startup overhead for alignment.

But in this case, you can handle possible misalignment very cheaply: check the first vector, then align the pointer. The first aligned vector will partially overlap the first vector if it wasn't aligned, but that's fine; you don't need to avoid checking the same element twice since you're not summing or anything. This trick also works for copy-and-modify loops like dst[i] = f(src[i]) where you never call it with dst==src to operate in place.

The same trick can be used for handling the end of the array, if the total array size is at least 1 vector large. If small arrays are not rare in your intended use-case, consider 128-bit or 256-bit vectors for cleanup, potentially allowing this trick. Or padding or aligning your arrays so it's safe to read past the end, masking away potential matches (set mask bits) from past where the array was supposed to end. (bzhi is good for this, _bzhi_u32)

Given AVX-512, you definitely have BMI2 for _tzcnt_u32, so use that instead of compiler-specific intrinsics for BSF. Or use C++20 std::countr_zero