Efficiency when swapping array contents

I'm new to assembly and I am trying to swap the contents between two arrays. I have this code so far and after testing it, I've verified it works. However, I am wondering if this is the most efficient way to get the desired result or if there is another possible solution to this that is more efficient?

arrW    WORD  100h, 200h, 300h
arrSW   SWORD  -140, 200, -300

mov ax, arrW
xchg ax, arrSW
xchg ax, arrW
mov ax, [arrW +2]
xchg ax, [arrSW +2]
xchg ax, [arrW +2]
mov ax, [arrW + 4]
xchg ax, [arrSW +4]
xchg ax, [arrW +4]

Solution

mov ax, arrW
xchg ax, arrSW
xchg ax, arrW
mov ax, [arrW +2]

The first thing that struck me is that second xchg. There's no sense in loading the AX register right before the other load of AX in the following instruction. The first rewrite that also gives a 20% speed increase on 8086 therefore is:

mov  ax, [arrW]
xchg ax, [arrSW]
mov  [arrW], ax
mov  ax, [arrW + 2]
xchg ax, [arrSW + 2]
mov  [arrW + 2], ax
mov  ax, [arrW + 4]
xchg ax, [arrSW + 4]
mov  [arrW + 4], ax

A solution where you avoid using the xchg instruction doesn't pay on 8086, but is the way to go on x86 in general. eg. Next snippet is 10% slower on 8086:

mov  ax, [arrW]
mov  bx, [arrSW]
mov  [arrW], bx
mov  [arrSW], ax

A loop can't beat your current unrolled code, but if the arrays should become larger then next is what it could look like:

 mov  cx, 3
 mov  si, OFFSET arrW
 mov  di, OFFSET arrSW
More:
 mov  ax, [si]
 mov  dx, [di]
 mov  [si], dx
 mov  [di], ax
 add  si, 2
 add  di, 2
 dec  cx
 jnz  More

In case the arrays arrW and arrSW follow each other in memory, the same loop is better written as:

 mov  bx, OFFSET arrW
More:
 mov  ax, [bx]
 mov  dx, [bx + 6]
 mov  [bx], dx
 mov  [bx + 6], ax
 add  bx, 2
 cmp  bx, OFFSET arrSW
 jb   More

If the CPU supports 32-bit registers, then working with these dwords can halve the number of iterations that are required. If the count of elements is odd, we peel off one word sized swap:

 mov  cx, 39
 mov  si, OFFSET arrW
 mov  di, OFFSET arrSW
 shr  cx, 1
 jnc  More             ; Count was even
 mov  ax, [si]
 mov  dx, [di]
 mov  [si], dx
 mov  [di], ax
 add  si, 2
 add  di, 2
More:
 mov  eax, [si]
 mov  edx, [di]
 mov  [si], edx
 mov  [di], eax
 add  si, 4
 add  di, 4
 dec  cx
 jnz  More

The above code peels off one word sized swap at the start of the loop. As @PeterCordes wrote in his comment below this answer, it is generally beter to put the peeled-off swap at the end (for reasons of data alignment). Next is that version:

 mov  cx, 39
 mov  si, OFFSET arrW
 mov  di, OFFSET arrSW
 shr  cx, 1            ; -> CF is set if count is odd
 jz   Next             \
More:                   |
 mov  eax, [si]         |
 mov  edx, [di]         |
 mov  [si], edx         |
 mov  [di], eax         | Nothing changes the CF
 lea  si, [si + 4]      |
 lea  di, [di + 4]      |
 dec  cx                |
 jnz  More              |
Next:                  /
 jnc  Done             ; (*) Count was even
 mov  ax, [si]
 mov  dx, [di]
 mov  [si], dx
 mov  [di], ax
Done: