I am trying to learn what _mm256_permute2f128_ps()
does, but can't fully understand the intel's code-example.
DEFINE SELECT4(src1, src2, control) {
CASE(control[1:0]) OF
0: tmp[127:0] := src1[127:0]
1: tmp[127:0] := src1[255:128]
2: tmp[127:0] := src2[127:0]
3: tmp[127:0] := src2[255:128]
ESAC
IF control[3]
tmp[127:0] := 0
FI
RETURN tmp[127:0]
}
dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0])
dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4])
dst[MAX:256] := 0
Specifically, I don't understand:
the imm8[3:0]
notation. Are they using it as a 4-byte mask?
But I've seen people invoke _mm256_permute2f128_pd(myVec, myVec, 5)
, where imm8 is used as a number (number 5).
Inside the SELECT4
function, what does control[1:0]
mean? Is control a byte-mask, or used as a number? How many bytes is it made of?
IF control[3]
is used in intel's example. Doesn't it undo the choice 3:
inside CASE
? Why would we ever want to set tmp[127 to 0]
to zero, if we've been outputting into it?The [x:y]
notations always refers to bitnumbers in this case. E.g., if you pass 5 as the imm8
argument, then (because 5==0b00000101
) imm8[3:0]==0b0101==5
and if that was passed as control
to the SELECT4
macro, you would get control[3]==0==false
and control[1:0]==0b01==1
. The control[2]
bit would be ignored.
Fully evaluating this, you get
dst[127:0] := SELECT4(a[255:0], b[255:0], 5) == a[255:128]
dst[255:128] := SELECT4(a[255:0], b[255:0], 0) == a[127:0]
That means this would switch the upper and lower half of the a
register and store it into the dst
register.
The dst[MAX:256] := 0
is only relevant for architectures with larger registers (if you have AVX-512), i.e., it sets everything above bit 255 to zero. This is in contrast to legacy SSE instructions, which (if executed on CPUs with AVX-support) would leave the upper half unchanged (and producing false dependencies -- see this related question).