I wrote a printint function in 64-bit NASM that prints an integer to STDOUT. It's really slow though, and after doing some benchmarks I determined that converting an integer to a string is the slowest part by far.
My current strategy for converting ints to strings goes like this:
I've tried Googling for how other people do it and it's more or less similar to what I do, dividing by 10 until the number is 0.
Here's the relevant code:
printint: ; num in edi
push rbp ; save base pointer
mov rbp, rsp ; place base pointer on stack
sub rsp, 20 ; align stack to keep 20 bytes for buffering
cmp edi, 0 ; compare num to 0
je _printint_zero ; 0 is special case
cmp edi, 0
jg _printint_pos ; don't print negative sign if positive
; print a negative sign (code not relevant)
xor edi, -1 ; convert into positive integer
add edi, 1
_printint_pos:
mov rbx, rsp ; set rbx to point to the end of the buffer
add rbx, 17
mov qword [rsp+8], 0 ; clear the buffer
mov word [rsp+16], 0 ; 10 bytes from [8,18)
_printint_loop:
cmp edi, 0 ; compare edi to 0
je _printint_done ; if edi == 0 then we are done
xor edx, edx ; prepare eax and edx for division
mov eax, edi
mov ecx, 10
div ecx ; divide and remainder by 10
mov edi, eax ; move quotient back to edi
add dl, 48 ; convert remainder to ascii
mov byte [rbx], dl ; move remainder to buffer
dec rbx ; shift 1 position to the left in buffer
jmp _printint_loop
_printint_done:
; print the buffer (code not relevant)
mov rsp, rbp ; restore stack and base pointers
pop rbp
ret
How can I optimize it so that it can run much faster? Alternatively, is there a significantly better method to convert an integer to a string?
I do not want to use printf or any other function in the C standard library
Turns out I was wrong about the source of the bottleneck. My benchmark was flawed. Although micro-optimizations such as magic number multiplication and better loops did help, the biggest bottleneck was the syscalls.
By using buffered reading & writing (buffer size of 16 kB), I was able to achieve my goal of reading and printing integers faster than scanf and printf.
Creating an output buffer sped up one particular benchmark by over 4x, whereas the micro-optimizations sped it up by about 25%.
For anyone who stumbles across this post in the future, here are the optimizations I made:
Another potential improvement I could make (but didn't) is dividing by a larger base and using a lookup table, as mentioned by phuclv in the comments.