Search code examples
assemblyx8632-bitmicro-optimizationavx512

AVX512BW: handle 64-bit mask in 32-bit code with bsf / tzcnt?


this is my code for 'strlen' function in AVX512BW

vxorps          zmm0, zmm0, zmm0   ; ZMM0 = 0
vpcmpeqb        k0, zmm0, [ebx]    ; ebx is string and it's aligned at 64-byte boundary
kortestq        k0, k0             ; 0x00 found ?
jnz             .chk_0x00

now for 'chk_0x00', in x86_64 systems, there is no problem and we can handle it like this:

chk_0x00:
kmovq   rbx, k0
tzcnt   rbx, rbx
add     rax, rbx

here we have a 64-bit register so we can store the mask into it but my question is about x86 systems where we don't have any 64-bit register so we must using 'memory' reserve (8-byte) and check both DWORD of the mask one by one (in fact, this is my way and i want to know if there is any better way)

chk_0x00:
kmovd   ebx, k0       ; move the first dword of the mask to the ebx
test    ebx, ebx      ; 0x00 found in the first dword ?
jz      .check_next_dword
bsf     ebx, ebx
add     eax, ebx
jmp     .done
.check_next_dword:
      add     eax, 32     ; 0x00 is not found in the first DWORD of the mask so we pass it by adding 32 to the length
      sub     esp, 8      ; reserve 8-byte from memory
      kmovq   [esp], k0   ; move the 8-byte MASK from k0 to our reserved memory
      mov     ebx, [esp+4] ; move the second DWORD of the mask to the ebx
      bsf     ebx, ebx
      add     eax, ebx
      add     esp, 8

in my x86 way, i used 'kmovd' to move the first DWORD of the mask into the ebx but i don't know what i have to do for the second DWORD of the mask !!! so i just reserved 8-byte from memory and move the mask (8-byte) into it then i moved the second dword into the ebx and checked it again ... is there any better solution ? (i think my way is not FAST enough) Also is it true to use vxorps to initializing a zmm register with zero ?


Solution

  • Looks like KSHIFTRQ could be used as an alternative, to right-shift top 32-bits of k0 counter to be lower 32-bits, which could be copied to the regular purpose register. Like:

    .check_next_dword:
          add     eax, 32     
          KSHIFTRQ k0, k0, 32  ;shift hi 32 bits to be low 32 bits
          kmovd   ebx, k0   
        ...
    

    And yes, vxorps zmm0, zmm0, zmm0 will set zmm0 to zero, as according to vxorps referense it's xor-ing without mask into 3-rd argument (you may check as well this SO question about zeroing zmm register)