Understanding a code-example from the Intel Intrinsics Guide

I am trying to learn what _mm256_permute2f128_ps() does, but can't fully understand the intel's code-example.

DEFINE SELECT4(src1, src2, control) {
    CASE(control[1:0]) OF
    0:  tmp[127:0] := src1[127:0]
    1:  tmp[127:0] := src1[255:128]
    2:  tmp[127:0] := src2[127:0]
    3:  tmp[127:0] := src2[255:128]
    ESAC
    IF control[3]
        tmp[127:0] := 0
    FI
    RETURN tmp[127:0]
}
dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0])
dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4])
dst[MAX:256] := 0

Specifically, I don't understand:

the imm8[3:0] notation. Are they using it as a 4-byte mask? But I've seen people invoke _mm256_permute2f128_pd(myVec, myVec, 5), where imm8 is used as a number (number 5).
Inside the SELECT4 function, what does control[1:0] mean? Is control a byte-mask, or used as a number? How many bytes is it made of?
why IF control[3] is used in intel's example. Doesn't it undo the choice 3: inside CASE? Why would we ever want to set tmp[127 to 0] to zero, if we've been outputting into it?

Solution

The [x:y] notations always refers to bitnumbers in this case. E.g., if you pass 5 as the imm8 argument, then (because 5==0b00000101) imm8[3:0]==0b0101==5 and if that was passed as control to the SELECT4 macro, you would get control[3]==0==false and control[1:0]==0b01==1. The control[2] bit would be ignored.

Fully evaluating this, you get

dst[127:0]   := SELECT4(a[255:0], b[255:0], 5) == a[255:128]
dst[255:128] := SELECT4(a[255:0], b[255:0], 0) == a[127:0]

That means this would switch the upper and lower half of the a register and store it into the dst register.

The dst[MAX:256] := 0 is only relevant for architectures with larger registers (if you have AVX-512), i.e., it sets everything above bit 255 to zero. This is in contrast to legacy SSE instructions, which (if executed on CPUs with AVX-support) would leave the upper half unchanged (and producing false dependencies -- see this related question).