Search code examples
assemblyx86cpu-registersavx512

Problem with AVX-512 code optimization (NASM)


I had nothing to do, so I decided to study AVX-512 and try to write something in assembler using it. I'm trying to optimize a piece of code that processes large amounts of data using AVX-512 instructions. The goal is to maximize the capabilities of vector registers and minimize the number of processor cycles.

The problem is this: I want to use masking to process only a portion of the elements in the zmm0 and zmm1 registers depending on a certain condition. However, AVX-512 instructions with masks (such as vaddps) require a mask in the k0-k7 register:

vmovups zmm0, [rsi] ; float[16] <= zmm0
vmovups zmm1, [rsi+64]
 
; some code here
 
vaddps zmm0, zmm0, zmm11 
vmovups [rdi], zmm0
 
add rsi, 128 ; ptr => next[data]
add rdi, 64 ; ptr => next[data] ?to: write

At the same time, the condition by which I want to mask the data is obtained by comparing the other two zmm registers.

So here's the question:

Is there any way to efficiently generate a mask in the k register based on a comparison of values in the zmm registers, and then use it for selective data processing using AVX-512 instructions? Or maybe there is another way to achieve the desired result using the AVX-512 without resorting to masks?

I remember that there is vpcmpd that compares the values of vector registers, and supposedly you can do something like k1 = zmm0 > zmm1 + k2 = zmm0 < zmm2, but honestly I have no idea how effective this can be; i tried, but due to lack of my knowledge, i threw away this idea.


Solution

  • To summarize the discussion from the comments: You were right to assume that the vcmp family of instructions such as vcmpps is the proper way to do this. Masked AVX512 instructions are generally fast. When possible, use the zeroing mask instructions such as vaddps zmm1{k1}{z}, zmm2, zmm3 over the merging instructions such as vaddps zmm1{k1}, zmm3, zmm0 to avoid depending on the previous register content.

    One thing to look out for with mask registers is that some of the instructions for computing them are rather slow. For example kadd has a latency of 4 on Intel according to uops.info while kand has a latency of only 1 but still only a throughput of 1.

    However, you can often avoid combining masks that way. vcmp itself accepts an input mask. The output mask will be zero where the input mask was zero. That's an AND connection. For example the condition zmm1 < zmm2 && zmm2 < zmm3 can be written as

    vcmpps  k1, zmm1, zmm2, 1     ; _CMP_LT_OS
    vcmpps  k1{k1}, zmm2, zmm3, 1
    

    We cannot form an OR connection that way but we can still avoid using two mask registers. For example zmm1 < zmm2 || zmm2 < zmm3 is the same as ! (! (zmm1 < zmm2) && ! (zmm2 < zmm3)) according to De Morgan's laws

    vcmpps  k1, zmm1, zmm2, 5     ; _CMP_NLT_US
    vcmpps  k1{k1}, zmm2, zmm3, 5
    knotw   k1, k1
    

    On the other hand, using two masks and merging them via korw would remove the input dependency from one vcmp to the other, potentially increasing the instruction-level parallelism.