Making my homework for implementing Conway's Game of Life using intrinsic functions found the working code, but cannot understand the main part of it.
This implementation first calculates amount of alive neighbors for each sell and store the result in an array counts
, so the array of sells (world) is states
. I cannot really get how the newstate
is generated here. I understand how left shift works, how bitwise OR works, but I cannot understand why they're used like this, why shufmask
is like this and how the shuffle works. Also cannot understand why _mm256_slli_epi16 used if the type of array elements is uint8_t. So my question is all about this string
__m256i newstate = _mm256_shuffle_epi8(shufmask, _mm256_or_si256(c, _mm256_slli_epi16(oldstate, 3)));
Could you please explain for me, dummy boy, if it's possible maximum detailed, how it works.
void gameoflife8vec(uint8_t *counts, uint8_t *states, size_t width, size_t height) {
assert(width % (sizeof(__m256i)) == 0);
size_t awidth = width + 2;
computecounts8vec(counts, states, width, height);
__m256i shufmask =
_mm256_set_epi8(
0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0
);
for (size_t i = 0; i < height; i++) {
for (size_t j = 0; j < width; j += sizeof(__m256i)) {
__m256i c = _mm256_lddqu_si256(
(const __m256i *)(counts + (i + 1) * awidth + j + 1));
c = _mm256_subs_epu8(
c, _mm256_set1_epi8(
1)); // max was 8 = 0b1000, make it 7, 1 becomes 0, 0 remains 0
__m256i oldstate = _mm256_lddqu_si256(
(const __m256i *)(states + (i + 1) * awidth + j + 1));
__m256i newstate = _mm256_shuffle_epi8(
shufmask, _mm256_or_si256(c, _mm256_slli_epi16(oldstate, 3)));
_mm256_storeu_si256((__m256i *)(states + (i + 1) * awidth + (j + 1)),
newstate);
}
}
}
The memory for array is allocated in this way
uint8_t *states = (uint8_t *)malloc((N + 2) * (N + 2) * sizeof(uint8_t));
uint8_t *counts = (uint8_t *)malloc((N + 2) * (N + 2) * sizeof(uint8_t));
Also the source code can be found here https://github.com/lemire/SIMDgameoflife
shuffle_epi8
is being used here as a parallel table-lookup, with a constant first operand and a variable 2nd operand.
Daniel's code does some calculations that produce a 4-bit integer for every byte in the vector, then uses _mm256_shuffle_epi8
to map those integers to 0 / 1 alive-or-dead new states.
Notice that the low and high lanes of shufmask
are identical: it's the same lookup table for both lanes. (It's not a lane-crossing shuffle, it's 32 parallel lookups from 2x 16-byte tables, using the low 4 bits in each element. And the high bit to zero it out.) See the intrinsic and asm instruction documentation.
shufmask
is a poor choice of variable name. It's not the shuffle-control vector. alivetable
might be a better choice.
Using [v]pshufb
to implement a 16-entry LUT is a (fairly) well-known technique. For example, it's one way to implement a popcnt for large arrays that's faster than scalar, splitting bytes into low/high nibble and looking up the 4-bit popcnt results. See Counting 1 bits (population count) on large data using AVX-512 or AVX-2, specifically https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp