multiplying two 32-Bit Numbers and printing the 64 bit result as decimal NASM assembly

I have a problem using NASM assembly.

I can't figure out how to multiply 2 numbers and print them to the screen.

The problem is that we are only allowed to use a function which only prints 32-bit length number; not 64-bit length numbers.

So my problem is probably with the math, I think I need to use Horner's method to get the the decimal number; like I indicate below.

If I have

AF / A = 11 remaining 5 
11 / A = 1 remaining 7
1 / A = 0 remaining 1

-> 175 which is the right result

but when I split it up in two registers here each 4 byte just as an example

A | F    A / A = 1 remaining 0 and F / A = 1 remaining 5
         1 / A = 0 remaining 1

->150 which is wrong

Here is my assembly code

mov eax, [Zahl1]
mul dword [Zahl2]
mov [High], edx


;---- low-----
mov ebx, 10
loopbegin:
;dividing by 10
xor edx, edx
div ebx

;counting
inc dword [counter]

;saving the number 
push edx
cmp eax, 0
jne loopbegin

mov ebx, 10
; --- high ----
mov eax, [High]
highloop:
xor edx, edx
div ebx

inc dword [counter]

push edx
cmp eax, 0
jne highloop

<note> here follows the loop that prints the numbers from the stack

Solution

You can't just convert+print the two halves separately, because the bits of the high half represent 4294967296 * hi in the whole 64-bit number.

4294967296 is not a power of 10, so bits in the high half affect the low decimal digits. If you were printing in a power-of-2 base, like hex or octal, your method would work because division by the radix would just be a shift: i.e. the low hex digit is determined by just the low 4 bits. But the low decimal digit depends on all 64 binary bits.

Instead, you need to do 64-bit division by 10. This takes multiple div instructions because div r32 (64b / 32b => 32b) faults if the quotient overflows 32 bits. See Assembler 64b division for working code for extended-precision division. (But don't use xchg with memory; use some extra registers instead).

(div is slow and mul is very fast on modern CPUs; it might be worth doing extended-precision multiply to get the high half of a 64b * 64b => 128b multiply for a fixed-point multiplicative inverse to divide by 10 faster.)

Also, you don't need to push the digits, and you don't need a counter in memory. Just use an extra register for a pointer that starts at the end of a buffer. See How do I print an integer in Assembly Level Programming without printf from the c library? for how to write the surrounding code, just replace the 32-bit division in the inner loop with extended-precision using two div instructions.