Proper use of _mm256_maskload_ps for loading less than 8 floats into __m256

I am having trouble wrapping my mind around which bits need to be set for masking using _mm256_maskload_ps.

The documentation states that the mask is the "integer value calculated based on the most-significant-bit of each doubleword of a mask register"

Parsing this out, I think that there are 4 64 bit integers. I want to mask 8 values so I can think of this as 8 32 bit integers (this is where my understanding gets shaky) each of which has a MSB reserved for sign, 1 being negative and 0 being positive. So I could set -1 for "please load this" and 0 for "dont load this" for 8 32 bit integers and my mask should be correct. However, we actually have 4 64 bit integers so maybe I have to pack them?

Essentially I'm looking for a way to describe a mask such that 1,2,3...8 of the first elements are set when i do _mm256_maskload_ps

Note: What's interesting is that when my mask is {-1, 0, 0, 0} the first 2 elements get set. when my mask is {0xFFFFFFFF, 0, 0, 0} only the first element gets set.

#include <iostream>
#include <immintrin.h>
#include <string>

using namespace std;

int main()
{
  float a[3] {1,2,3};
  float b[3] {11, 22, 33};

  auto disp = [](float *arr) {
    cout << "[";
    string sep;
    for (size_t i = 0; i < 3; i++)
    {
      cout << sep << arr[i];
      sep = ", ";
    }
    cout << "]";
    cout << endl;
  };
  disp(a);
  disp(b);

  __m256 _a, _b;
  __m256i _load_mask = {-1, 0, 0, 0};


  _a = _mm256_maskload_ps(a, _load_mask);
  _b = _mm256_maskload_ps(b, _load_mask);
  _a = _mm256_add_ps(_a, _b);


  float c[8];
  _mm256_storeu_ps(c, _a);
  disp(c);

  return 0;
}

Displays

[1, 2, 3]
[11, 22, 33]
[12, 24, 0]

when compiled with

!clang++ -mavx -Wall -Wextra -std=c++17 -stdlib=libc++ -ggdb % -o $(basename -s .cpp %

on my mac, where % is the filename

Solution

A doubleword is 32-bits, not 64. Word = 16, doubleword = 32, quadword = 64. The first two elements get selected because -1 is all ones across all 64 bits, so when the maskload treats it as two 32-bit values instead of one 64-bit value the highest bit of both elements will be set. 0xFFFFFFFF, OTOH, is the least sigificant 32 bits set and the most significant 32 bits unset. Since x86 is little-endian the least significant bits come first, which is why you end up with he first element selected but not the second.

The documentation in the intrinsics guide is much better here.

Note that on GCC/clang, __m256i is implemented using vector extensions. MSVC, however, does not support vector extensions so your code won't work there. Also, both GCC and clang use a vector of 64-bit values even though the same __m256i type is used for all integer vectors, so you'll probably want to use _mm256_set_epi32, _mm256_setr_epi32 or _mm256_load_si256 to create your _load_mask anyways.

Oh, names starting with an underscore are reserved in both C and C++. Don't do that. You can use a trailing underscore if you really need to convey that it's an internal variable or something, but I don't really see a reason to do that in tho code you've posted above.