Search code examples
arraysassemblyx86x86-16

Efficiency when swapping array contents


I'm new to assembly and I am trying to swap the contents between two arrays. I have this code so far and after testing it, I've verified it works. However, I am wondering if this is the most efficient way to get the desired result or if there is another possible solution to this that is more efficient?

arrW    WORD  100h, 200h, 300h
arrSW   SWORD  -140, 200, -300

mov ax, arrW
xchg ax, arrSW
xchg ax, arrW
mov ax, [arrW +2]
xchg ax, [arrSW +2]
xchg ax, [arrW +2]
mov ax, [arrW + 4]
xchg ax, [arrSW +4]
xchg ax, [arrW +4]

Solution

  • mov ax, arrW
    xchg ax, arrSW
    xchg ax, arrW
    mov ax, [arrW +2]
    

    The first thing that struck me is that second xchg. There's no sense in loading the AX register right before the other load of AX in the following instruction. The first rewrite that also gives a 20% speed increase on 8086 therefore is:

    mov  ax, [arrW]
    xchg ax, [arrSW]
    mov  [arrW], ax
    mov  ax, [arrW + 2]
    xchg ax, [arrSW + 2]
    mov  [arrW + 2], ax
    mov  ax, [arrW + 4]
    xchg ax, [arrSW + 4]
    mov  [arrW + 4], ax
    

    A solution where you avoid using the xchg instruction doesn't pay on 8086, but is the way to go on x86 in general. eg. Next snippet is 10% slower on 8086:

    mov  ax, [arrW]
    mov  bx, [arrSW]
    mov  [arrW], bx
    mov  [arrSW], ax
    

    A loop can't beat your current unrolled code, but if the arrays should become larger then next is what it could look like:

     mov  cx, 3
     mov  si, OFFSET arrW
     mov  di, OFFSET arrSW
    More:
     mov  ax, [si]
     mov  dx, [di]
     mov  [si], dx
     mov  [di], ax
     add  si, 2
     add  di, 2
     dec  cx
     jnz  More
    

    In case the arrays arrW and arrSW follow each other in memory, the same loop is better written as:

     mov  bx, OFFSET arrW
    More:
     mov  ax, [bx]
     mov  dx, [bx + 6]
     mov  [bx], dx
     mov  [bx + 6], ax
     add  bx, 2
     cmp  bx, OFFSET arrSW
     jb   More
    

    If the CPU supports 32-bit registers, then working with these dwords can halve the number of iterations that are required. If the count of elements is odd, we peel off one word sized swap:

     mov  cx, 39
     mov  si, OFFSET arrW
     mov  di, OFFSET arrSW
     shr  cx, 1
     jnc  More             ; Count was even
     mov  ax, [si]
     mov  dx, [di]
     mov  [si], dx
     mov  [di], ax
     add  si, 2
     add  di, 2
    More:
     mov  eax, [si]
     mov  edx, [di]
     mov  [si], edx
     mov  [di], eax
     add  si, 4
     add  di, 4
     dec  cx
     jnz  More
    

    The above code peels off one word sized swap at the start of the loop. As @PeterCordes wrote in his comment below this answer, it is generally beter to put the peeled-off swap at the end (for reasons of data alignment). Next is that version:

     mov  cx, 39
     mov  si, OFFSET arrW
     mov  di, OFFSET arrSW
     shr  cx, 1            ; -> CF is set if count is odd
     jz   Next             \
    More:                   |
     mov  eax, [si]         |
     mov  edx, [di]         |
     mov  [si], edx         |
     mov  [di], eax         | Nothing changes the CF
     lea  si, [si + 4]      |
     lea  di, [di + 4]      |
     dec  cx                |
     jnz  More              |
    Next:                  /
     jnc  Done             ; (*) Count was even
     mov  ax, [si]
     mov  dx, [di]
     mov  [si], dx
     mov  [di], ax
    Done: