Search code examples

How to populate a 64-bit register with duplicate byte values?

I'm doing some x64 assembly with Visual C++ 2010 and MASM (fastcall calling convention).

So let's say I have a function in C++:

extern "C" void fillArray(unsigned char* byteArray, unsigned char value);

The pointer to the array will be in RCX and the char value will be in DL.

How can I fill RAX with values using DL, such that if I were to mov qword ptr [RCX], RAX and print byteArray, all the values would be equal to the char value?

Please note that I'm not trying to out-optimize my compiler, I'm just learning.


  • Because you called your procedure 'fillArray', I assumed you like to fill a whole memory block with a byte value. So I did a comparison on different approaches. It is 32-bit MASM code, but the results should be similar in 64-bit mode. Each approach is tested with both aligned and unaligned buffers. Here are the results:

    Simple REP STOSB - aligned....: 192
    Simple REP STOSB - not aligned: 192
    Simple REP STOSD - aligned....: 191
    Simple REP STOSD - not aligned: 222
    Simple while loop - aligned....: 267
    Simple while loop - not aligned: 261
    Simple while loop with different addressing - aligned....: 271
    Simple while loop with different addressing - not aligned: 262
    Loop with 16-byte SSE write - aligned....: 192
    Loop with 16-byte SSE write - not aligned: 205
    Loop with 16-byte SSE write non-temporal hint - aligned....: 126 (EDIT)

    The most naive variant using the following code seems to perform best in both scenarios and has the smallest code size as well:

    mov al, 44h   ; byte value
    mov edi, lpDst
    mov ecx, 256000*4  ; buf size
    rep stosb

    EDIT: It's not the fastest for aligned data. Added MOVNTDQ version which performs best, see below.

    For the sake of completeness, here are excerpts from the other routines - the value is assumed to be expanded into EAX before:

    Rep Stosd:

    mov edi, lpDst
    mov ecx, 256000
    rep stosd

    Simple While:

    mov edi, lpDst
    mov ecx, 256000
    .while ecx>0
        mov [edi],eax
        add edi,4
        dec ecx

    Different simple while:

    mov edi, lpDst
    xor ecx, ecx
    .while ecx<256000 
        mov [edi+ecx*4],eax
        inc ecx


    movd xmm0,eax
    punpckldq xmm0,xmm0    ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
    punpcklqdq xmm0,xmm0   ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH
    mov ecx, 256000/4   ; 16 byte
    mov edi, lpDst
    .while ecx>0 
        movdqa xmmword ptr [edi],xmm0    ; movdqu for unaligned
        add edi,16
        dec ecx


    movd xmm0,eax
    punpckldq xmm0,xmm0    ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
    punpcklqdq xmm0,xmm0   ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH
    mov ecx, 256000/4   ; 16 byte
    mov edi, lpDst
    .while ecx>0 
        movntdq xmmword ptr [edi],xmm0
        add edi,16
        dec ecx

    I uploaded the whole code here --- the MASM package from hutch is required for assembling.

    If SSSE3 is available, you can use pshufb to broadcast a byte to all positions of a register instead of a chain of punpck instructions.

    movd    xmm0, edx
    xorps   xmm1,xmm1      ; xmm1 = 0
    pshufb  xmm0, xmm1     ; xmm0 = _mm_set1_epi8(dl)