I am looking for a way to convert 64bit number to a string (and possibly the other way around) using 32bit system. I'm not asking for code, just asking for some ideas.
The only hard part is dividing a 64bit number by 10 on a 32bit machine. Everything else is pretty much the same as the normal case where numbers fit in a single register.
Often you can look at gcc output for hints on how to do things in asm, but in this case it just calls the __udivdi3
libgcc helper function :/
If you're just doing this as a learning exercise, then probably you should just look up an extended-precision div algorithm and use it. Here's one, from book, using Intel syntax and 16bit operations. The variable-names are clear, and there's explanatory text, so you should be able to re-implement it for 32bit. Google on that phrase for more hits, and / or look at the libgcc source code.
See also implementing school-like division on 32bit chunks on x86
If you're implementing this for real (for high performance):
Remember that x86's div
instruction does a 64b/32b -> 32b division (but faults if the quotient overflows a 32bit register). So you could check if the low bits of your high dword are small enough, and if so you only need a single division for the first step to get the high digit.
As soon as your number is small enough to divide with a single div
, break out of the extended-precision loop and use a single div
per digit.
That probably only takes one iteration to reduce down to a 32bit number. At that point you can divide by 10 using the multiplicative inverse:
// from the godbolt link: gcc5.3 -O3 -m32
uint32_t div10_u32(uint32_t x) { return x/10; }
movl $-858993459, %edx # 0xcccccccd
movl %edx, %eax # gcc is dumb: no need for this mov. clang avoids it
mull 4(%esp)
movl %edx, %eax
shrl $3, %eax
ret
Note how this uses the high half of the result of a full-multiply (32bx32b->64b).
It might be faster to do the whole thing using multiplicative inverses, even though that means doing a 64 x 64b -> 128b multiply on a 32bit machine. Integer division is very slow, and barely pipelined, but integer mul is very fast on Intel CPUs.
AVX512-DQ adds a 64x64 -> 64b low multiply instruction, but that doesn't for extended precision. AVX512-IFMA adds 52bx52b low and high multiply instructions, so in a few years it might be worth having a code-path for that (32bit binaries running on hardware with AVX512-IFMA), when the top 64-52 bits of your number is all-zero.