how to convert 64bit number to string in x86 assembly?

I am looking for a way to convert 64bit number to a string (and possibly the other way around) using 32bit system. I'm not asking for code, just asking for some ideas.

Solution

The only hard part is dividing a 64bit number by 10 on a 32bit machine. Everything else is pretty much the same as the normal case where numbers fit in a single register.

Often you can look at gcc output for hints on how to do things in asm, but in this case it just calls the __udivdi3 libgcc helper function :/

If you're just doing this as a learning exercise, then probably you should just look up an extended-precision div algorithm and use it. Here's one, from book, using Intel syntax and 16bit operations. The variable-names are clear, and there's explanatory text, so you should be able to re-implement it for 32bit. Google on that phrase for more hits, and / or look at the libgcc source code.

If you're implementing this for real (for high performance):

Remember that x86's div instruction does a 64b/32b -> 32b division (but faults if the quotient overflows a 32bit register). So you could check if the low bits of your high dword are small enough, and if so you only need a single division for the first step to get the high digit.

As soon as your number is small enough to divide with a single div, break out of the extended-precision loop and use a single div per digit.

That probably only takes one iteration to reduce down to a 32bit number. At that point you can divide by 10 using the multiplicative inverse:

// from the godbolt link: gcc5.3 -O3 -m32
uint32_t div10_u32(uint32_t x) { return x/10; }
    movl    $-858993459, %edx     # 0xcccccccd
    movl    %edx, %eax            # gcc is dumb: no need for this mov.  clang avoids it
    mull    4(%esp)
    movl    %edx, %eax
    shrl    $3, %eax
    ret

Note how this uses the high half of the result of a full-multiply (32bx32b->64b).

It might be faster to do the whole thing using multiplicative inverses, even though that means doing a 64 x 64b -> 128b multiply on a 32bit machine. Integer division is very slow, and barely pipelined, but integer mul is very fast on Intel CPUs.

AVX512-DQ adds a 64x64 -> 64b low multiply instruction, but that doesn't for extended precision. AVX512-IFMA adds 52bx52b low and high multiply instructions, so in a few years it might be worth having a code-path for that (32bit binaries running on hardware with AVX512-IFMA), when the top 64-52 bits of your number is all-zero.