I'm looking to reverse a string in the shortest amount of assembly code possible.
I can only use SSSE3 extensions or less because of the lack of Unicorn support. I've tried accessing ymm & zmm instructions but it breaks every time.
Even though the SSSE3 instructions are more concise, the 16-byte pshufb
control vector for byte-reversing a 128 bit XMM register still takes up 16 bytes and makes it even longer. I'm open for any ideas but the following are my best attempts.
I need 32 bytes or less and the smaller the better. The best I've got so far is 42 but that's when I'm assuming the size of the string inside rdx (or ecx if using x86) is 30.
Ideally, it would be able to dynamically get the size by checking for a null terminator.
The string address is sitting inside of rdx (or ecx if using x86).
Additional restriction: no use of stack space. This block of code has to run without RSP pointing to usable stack memory.
Standard x86 / 64 - 42 bytes
; get values in registers
mov rax, [rdx]
mov rbx, [rdx + 8]
mov rcx, [rdx + 16]
mov r8, [rdx + 24]
; swap bytes around
bswap rax
bswap rbx
bswap rcx
bswap r8
; shift it right by 2 because of the nulls
sar r8, 16
; put it back
mov [rdx], r8
mov [rdx + 0x6], rcx
mov [rdx + 0xE], rbx
mov [rdx + 0x16], rax
SSE3 - 62 bytes (because of the byte array, otherwise it's 46)
movdqu xmm3, [rip + 0x27]
movdqu xmm0, [rdx]
movdqu xmm1, [rdx] + 0x10
pshufb xmm0,xmm3
pshufb xmm1,xmm3
movdqu [rdx], xmm1
movdqu xmm1, [rdx+0x2]
movdqu [rdx], xmm1
movdqu [rdx+0xE], xmm0
hlt
; this would be tacked on to the end of the assembly as the rip + 0x27 value
\x00\x0F\x0E\x0D\x0C\x0B\x0A\x09\x08\x07\x06\x05\x04\x03\x02\x01
The following 31 bytes of x86-64 assembler code for void strrev(char* p)
will reverse a string of any length (including the empty string) in-place, using nothing but the base instruction set.
However, the routine requires the pointer to the string in register rdi
(in agreement with the System V ABI), not rdx
. A mov rdi, rdx
would cost 3 bytes. Also, due to use of two implicitly-locked xchg
, performance is going to be awful.
The small size is in part due to creative use of the single-byte stosb
/lodsb
instructions' side-effects of reading and incrementing/decrementing rdi
and rsi
respectively depending on the Direction Flag, which can be set and cleared by means of single-byte instructions std
/cld
.
If the code were x86-32 or could limit itself to strings < 4GB, a few bytes of extra savings can be had.
0000000000000000 <strrev>:
0: 31 c0 xor eax,eax
2: 48 8d 48 ff lea rcx,[rax-0x1]
6: 48 89 fe mov rsi,rdi
9: f2 ae repnz scas al,BYTE PTR es:[rdi]
b: 48 83 ef 02 sub rdi,0x2
f: 48 39 f7 cmp rdi,rsi
12: 7e 0a jle 1e <strrev+0x1e>
14: 86 07 xchg BYTE PTR [rdi],al
16: 86 06 xchg BYTE PTR [rsi],al
18: fd std
19: aa stos BYTE PTR es:[rdi],al
1a: fc cld
1b: ac lods al,BYTE PTR ds:[rsi]
1c: eb f1 jmp f <strrev+0xf>
1e: c3 ret