I have an function written in F# for .NET that uses SSE2. I've written the same thing using AVX2 but the underlying question is the same. What is the intended purpose of a MoveMask
? I know that it works for my purposes, I want to know why.
I am iterating through two 64-bit float arrays, a
and b
, testing that all of their values match. I am using the CompareEqual
method (which I believe is wrapping a call to __m128d _mm_cmpeq_pd
) to compare several values at a time. I then compare that result with a Vector128
of 0.0
64-bit float. My reasoning is that the result of CompareEqual
will give a 0.0
value in the cases where the values don't match. Up to this point, it makes sense.
I then use the Sse2.MoveMask
method on the result of the comparison with the zero vector. I've previously worked on using SSE
and AVX
for matching and I saw examples of people using MoveMask
for the purpose for testing for non-zero values. I believe this method is using the int _mm_movemask_epi8
Intel intrinsic. I have included the F# code and the assembly that is JITed.
Is this really the intention of a MoveMask
or is it just a happy coincidence it works for these purposes. I know my code works, I want to know WHY it works.
#nowarn "9" "51" "20" // Don't want warnings about pointers
open System
open FSharp.NativeInterop
open System.Runtime.Intrinsics.X86
open System.Runtime.Intrinsics
open System.Collections.Generic
let sseFloatEquals (a: array<float>) (b: array<float>) =
if a.Length = b.Length then
let mutable result = true
let mutable idx = 0
if a.Length > 3 then
let lastBlockIdx = a.Length - (a.Length % Vector128<float>.Count)
let aSpan = a.AsSpan ()
let bSpan = b.AsSpan ()
let aPointer = && (aSpan.GetPinnableReference ())
let bPointer = && (bSpan.GetPinnableReference ())
let zeroVector = Vector128.Create 0.0
while idx < lastBlockIdx && result do
let aVector = Sse2.LoadVector128 (NativePtr.add aPointer idx)
let bVector = Sse2.LoadVector128 (NativePtr.add bPointer idx)
let comparison = Sse2.CompareEqual (aVector, bVector)
let zeroTest = Sse2.CompareEqual (comparison, zeroVector)
// The line I want to understand
let matches = Sse2.MoveMask (zeroTest.AsByte ())
if matches <> 0 then
result <- false
idx <- idx + Vector128.Count
while idx < a.Length && idx < b.Length && result do
if a.[idx] <> b.[idx] then
result <- false
idx <- idx + 1
result
else
false
; Core CLR 5.0.921.35908 on amd64
_.sseFloatEquals$cont@11(System.Double[], System.Double[], Microsoft.FSharp.Core.Unit)
L0000: push rdi
L0001: push rsi
L0002: push rbp
L0003: push rbx
L0004: sub rsp, 0x28
L0008: vzeroupper
L000b: mov eax, 1
L0010: xor r8d, r8d
L0013: mov r9d, [rcx+8]
L0017: cmp r9d, 3
L001b: jle short L008e
L001d: mov r10d, r9d
L0020: and r10d, 1
L0024: mov r11d, r9d
L0027: sub r11d, r10d
L002a: lea r10, [rcx+0x10]
L002e: mov esi, r9d
L0031: test rdx, rdx
L0034: jne short L003c
L0036: xor edi, edi
L0038: xor ebx, ebx
L003a: jmp short L0043
L003c: lea rdi, [rdx+0x10]
L0040: mov ebx, [rdx+8]
L0043: xor ebp, ebp
L0045: test esi, esi
L0047: je short L004c
L0049: mov rbp, r10
L004c: xor r10d, r10d
L004f: test ebx, ebx
L0051: je short L0056
L0053: mov r10, rdi
L0056: vxorps xmm0, xmm0, xmm0
L005a: cmp r8d, r11d
L005d: jge short L008e
L005f: mov esi, eax
L0061: test esi, esi
L0063: je short L008e
L0065: movsxd rsi, r8d
L0068: vmovupd xmm1, [rbp+rsi*8]
L006e: vmovupd xmm2, [r10+rsi*8]
L0074: vcmpeqpd xmm1, xmm1, xmm2
L0079: vcmpeqpd xmm1, xmm1, xmm0
L007e: vpmovmskb esi, xmm1
L0082: test esi, esi
L0084: je short L0088
L0086: xor eax, eax
L0088: add r8d, 4
L008c: jmp short L005a
L008e: cmp r9d, r8d
L0091: jle short L00c8
L0093: cmp [rdx+8], r8d
L0097: jle short L00c8
L0099: mov r10d, eax
L009c: test r10d, r10d
L009f: je short L00c8
L00a1: cmp r8d, r9d
L00a4: jae short L00d1
L00a6: movsxd r10, r8d
L00a9: vmovsd xmm0, [rcx+r10*8+0x10]
L00b0: cmp r8d, [rdx+8]
L00b4: jae short L00d1
L00b6: vucomisd xmm0, [rdx+r10*8+0x10]
L00bd: jp short L00c1
L00bf: je short L00c3
L00c1: xor eax, eax
L00c3: inc r8d
L00c6: jmp short L008e
L00c8: add rsp, 0x28
L00cc: pop rbx
L00cd: pop rbp
L00ce: pop rsi
L00cf: pop rdi
L00d0: ret
L00d1: call 0x00007ffcef38a370
L00d6: int3
_.sseFloatEquals(System.Double[], System.Double[])
L0000: mov r8d, [rcx+8]
L0004: cmp r8d, [rdx+8]
L0008: jne short L0012
L000a: xor r8d, r8d
L000d: jmp 0x00007ffc99000480
L0012: xor eax, eax
L0014: ret
MoveMask
just extracts the high bit of each element into an integer bitmap. You have 3 element-size options: movmskpd
(64-bit), movmskps
(32-bit), and pmovmskb
(8-bit).
This works well with SIMD compares, which produce an output that has all-zero when the predicate is false, all-one bits in elements where the predicate is true. All-ones is a bit-pattern for -QNaN
if interpreted as an IEEE-FP floating-point value, but normally you don't do that. Instead movemask, or AND, (or AND / ANDN / OR or _mm_blend_pd
) or things like that with a compare result.
movemask(v) != 0
, movemask(v) == 0x3
, or movemask(v) == 0
is how you check conditions like at least one element in a compare matched, or all matched, or none matched, respectively, where v
is the result of _mm_cmpeq_pd
or whatever. (Or just to extract signs directly without a compare).
For other element sizes, 0xf
or 0xffff
to match all four or all 16 bits. Or for AVX 256-bit vectors, twice as many bits, up to filling a whole 32-bit integer with vpmovmskb eax, ymm0
.
What you're doing is really weird, using a 0.0 / NaN compare result as the input to another compare with vcmpeqpd xmm1, xmm1, xmm2
/ vcmpeqpd xmm1, xmm1, xmm0
. For the 2nd comparison, that can only be true for elements that are == 0.0
(i.e. +-0.0), because x == NaN
is false for every x
.
If the second vector is a constant zero (let zeroTest = Sse2.CompareEqual (comparison, zeroVector)
, that's pointless, you're just inverting the compare result which you could have done by checking a different integer condition or against a different constant, not doing runtime comparisons. (0.0 == 0.0
is true, producing an all-ones output, 0.0 == -NaN
is false, producing an all-zero output.)
To learn more about intrinsics and SIMD, see for example Agner Fog's optimization guide; his asm guide has a chapter on SIMD. Also, his VectorClass library for C++ has some useful wrappers, and for learning purposes seeing how those wrapper functions implement some basic things could be useful.
To learn what things actually do, see Intel's intrinsics guide. You can search by asm instruction or C++ intrinsic name.
I think MS has docs for their C# System.Runtime.Intrinsics.X86, and I assume F# uses the same intrinsics, but I don't use either language myself.
Related re: comparisons:
Get the last line separator - pcmpeqb -> pmovmskb -> bsr
to find the position of the last match element in a vector of compare results. Bit-scan reverse on the compare mask. Often you want to scan forward to find the first match (or invert and find first mismatch, like for memcmp
). e.g. Compare 16 byte strings with SSE
Or popcount them if you're counting occurrences by matching against a loop-invariant vector of a broadcasted character: How can I count the occurrence of a byte in array using SIMD? - instead of movemask, use the compare result as integer 0 / -1. SIMD subtract from a vector accumulator in the inner loop, then horizontal sum of integer elements in an outer loop.
SIMD instructions for floating point equality comparison (with NaN == NaN) - useful exercise in understanding how NaNs work.