SSE comparisons not working as intended while doing a linear search over an array of integers in c++

I have the following code intended to execute a linear search through an array using streaming SIMD extensions in c++:

#include <iostream>
#include <emmintrin.h>

using namespace std;

bool sse2_search_array(int* arr, int size, int key) {
    int iterations;
    if (size % 16 == 0) {
        iterations = size / 16;
    }
    else {
        iterations = size / 16 + 1;
    }
    __m128i* arr_ = reinterpret_cast<__m128i*>(arr);  /*Cast to corresponding int type for 128 bit registers. Each __m128i
                                occupies 8 bits, so 16 integers can be processed simultaneously.*/
    __declspec(align(16)) int key_arr[16];
    fill_n(key_arr, 16, key);  /*fill key array with 16 keys (for SSE comparisons)*/
    __m128i* key_arr_ = reinterpret_cast<__m128i*>(key_arr);

    int result;
    /*Actual search begins here.*/
    for (int i = 0; i < iterations; i++, arr_++) {
        result = _mm_movemask_epi8(_mm_cmpeq_epi8( *key_arr_, *arr_));  /*Comparison of 2 16 bit arrays simultaneously.*/
        cout << "result: " << result << endl;
        if (result != 0) { return true; }
    }
    return false;

}

int main() {
    __declspec(align(16)) int example_array[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6 };

    cout << "found: " << sse2_search_array(example_array, 16, 128);
    cin.get();
}

It works but the example in the main function should return false since 128 is not in example_array, but sse2_search_arrayseems to always return true, and the value of result in the example is 1110111011101110b or 61166 and that does not make sense to me because I'm expecting it to be 0. So can somebody tell me what the problem is and how I can fix it? I am not very experienced with c++ and know very little of SSE.

Solution

Two major problems:

Never fill a scratch array just so you can load it as a vector:

__declspec(align(16)) int key_arr[16];
fill_n(key_arr, 16, key);  /*fill key array with 16 keys (for SSE comparisons)*/
__m128i* key_arr_ = reinterpret_cast<__m128i*>(key_arr);

Instead, use __m128i keyvec = _mm_set1_epi8(key);. There are much faster ways to broadcast a byte to all positions of a vector than doing 16 scalar stores to memory and then a vector load (which will suffer from a store-fowarding stall). Let the compiler pick for you by using _mm_set intrinsics instead of writing to local arrays.

int is 4 bytes (on all modern x86 compilers), but you apparently want to work with arrays of single-byte elements, since you're using _mm_cmpeq_epi8. Your example_array is actually 16 * 4 bytes long:

__declspec(align(16)) int example_array[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6 };
// equivalent byte array (on little-endian x86):
uint8_t byte_array[16*sizeof(int)] = { 1,0,0,0,  2,0,0,0,  3,0,0,0, ... };

Your comments are often completely wrong, e.g. Comparison of 2 16 bit arrays simultaneously. Perhaps you meant "byte"?

If you actually wanted to search arrays of int, use _mm_set1_epi32(key) and _mm_cmpeq_epi32. A 16 byte vector holds four ints. The movemask result is still based on bytes, but each group of 4 bits in the result will be the same.

See also the sse tag wiki and the x86 tag wiki for useful links. The c++ tag wiki has a lot of good stuff for the language in general, since you said you're new to that, too.

IDK why you're getting hits for key=128; that doesn't seem to make sense unless there's even more wrong with your code that I didn't notice.

Your debugger should be able to show you what's in your __m128i variables. Storing some temporaries in variables will make it easier to look at them with a C++ source-level debugger, instead of single-stepping the asm code.