Strange issue with my factorial program

I decided to learn x86 Assembly as my first serious programming language.

I decided to write a program that calculates the factorial of a given number.

The code works fine up until anything larger than 12!, after that I get incorrect results.

I suspect It is due to the result being larger than 32 bits. Correct?

To correct this I tried to rcl the edx register.

15! should be 1307674368000 but, it returns 27701857280.

.386
.model flat, stdcall
option casemap :none  

includelib \masm32\lib\msvcrt.lib
sprintf proto C :vararg
includelib \masm32\lib\user32.lib 
MessageBoxA proto :ptr,:ptr,:ptr,:DWORD
includelib \masm32\lib\kernel32.lib
ExitProcess proto :dword 

.data
   format db "%lld", 13, 10, 0
   _title db "Result",13,10,0

.code

main PROC
    LOCAL szBuf[9]:byte
xor edx,edx
rcl edx ,1
xor ebx,ebx



mov eax, 15 ; start of 15!
mov ebx,eax ; Prepares # of loop counter cycle


factoral:

dec ebx     ;loop counter
jz ready    ;when ebx = 0 jump to ready step
imul eax, ebx ; Multiply for intermeddiate result.
rcl edx, 1 ; Shift carry flag to edx to handle > 32 bit results.
jnz factoral ; Continue loop counter when ebx > 0




ready:  
    invoke sprintf, addr szBuf, offset format, eax, edx
    invoke MessageBoxA, 0, addr szBuf, offset _title, 0
    invoke ExitProcess, 0
main ENDP
END main

Extra: Would using shl eax, 1 to calculate the 2nd degree portion (n*2) for the intermediate be better than using imul for each and every degree.

Example: 5!

1) (5*4 =20)

2) (20*3 = 60)

3) (60 left bit shift 1 time = 120)

4) (120 * 1 = 120)

Solution

I suspect It is due to the result being larger than 32 bits. Correct?

Precisely. 12! == 479,001,600, which can be represented in 32 bits (as an unsigned quantity, but that's all in interpretation, not representation). However, 13! == 6,227,020,800, which overflows 32 bits. If you use a calculator that can show you a representation of the number in binary (Windows, macOS, and most Linux desktops have such a programmers' calculator built in), you'd see that the 64-bit representation has bit 32 set. Obviously it would overflow if you only had 32 bits in total!

Regarding your code, it isn't clear to me what you expect RCL to do here that is useful. This instruction is basically a rotation through the carry flag (CF). It shifts CF into the least-significant bit (LSB) while shifting the most-significant bit (MSB) into CF. The Intel architecture manuals have a pretty picture of this that may be more clear:

I can't see any way that this would help you to handle values larger than 32 bits. I mean, it is true that IMUL sets CF when the multiplication causes a bit to be carried into the upper half of the result, but the rotation isn't going to magically allow you to represent a 64-bit quantity in a 32-bit register. (If this rotation would have gotten you the right result, presumably Intel would have just done it as part of the multiplication?)

There is an instruction you can use to get a 64-bit product of a 32-bit multiplication. It also has the IMUL mnemonic, but it's the form that takes only one operand:

IMUL r/m32

This multiplies EAX (hard-coded) by the specified operand (r/m32, which means either a 32-bit register or a 32-bit value read from a memory location), putting the 64-bit result in EDX:EAX (also hard-coded). Note that the EDX:EAX notation means that the high-order bits are in EDX, and the low-order bits are in EAX. This is a standard convention for representing 64-bit values on 32-bit x86 architectures.

So, the simple fix to your code would be:

    mov  eax, 13      ; initial value
    mov  ecx, eax     ; loop counter

Factorial:
    dec  ecx          ; decrement counter
    jz   Finished     ; when counter == 0, we're done
    imul ecx          ; multiply by counter (EDX:EAX = EAX * ECX)
    jmp  Factorial    ; go back to top of loop

Finished:
    ...

Notice that I've used ECX for the counter, instead of EBX, because that's more idiomatic. It doesn't really matter which register you use, unless the instruction uses hard-coded registers like IMUL, but when it's available, it's common to use ECX for a counter. (That was its original purpose.) Also, when you start interoperating with C/C++ code, you'll need to pay attention to the calling convention, where EAX, ECX, and EDX are registers that your procedure can clobber, whereas you are expected to save and restore the original value of the other registers. That means avoiding EBX unless you absolutely need it saves you some code.

Also, you don't need to clear a register before initializing it. As such, code like:

xor ebx,ebx
...
mov ebx,eax ; Prepares # of loop counter cycle

is silly/unnecessary. Just do the MOVe.

Oh, and this code:

jnz factoral ; Continue loop counter when ebx > 0

never worked. You were trying to use the zero flag (ZF) set by the initial dec ebx, but the other intervening instructions clobber the flags, so you weren't reading the correct flag value. You'd have needed to do a comparison of EBX immediately before, to get the flags set.

Anyway, at the end of this code, you'll end up at Finished, and the factorial will be in EDX:EAX.

But, this will only work for 13!. After that, it will fail. Why? Because IMUL only uses EAX as its multiplicand, not EDX:EAX. The product of 13×12×11×10×9×8×7×6×5×4×3 fits fine in EAX, then that is multiplied by 2, the product of which fits in EDX:EAX. But if you had tried to do 15!, you'd overflow into EDX:EAX earlier, yet EDX would be ignored by subsequent multiplications.

Therefore, you need to get more clever and write code that actually does a full 64-bit multiplication—that is, multiplies a 64-bit multiplicand by a 32-bit multiplier to get a 64-bit product.

Luckily, that isn't difficult, especially since factorials are, by definition, taken only on non-negative values, so we don't need to worry about negative quantities. In other words, we just need to do an unsigned multiplication.

By the way, your printf format string should be "%llu", because the result should be interpreted as an unsigned quantity.

The code for this would be:

; EAX = divisor
; ECX = high bits of dividend
; EDX = low bits of dividend
imul  ecx, eax      ; multiply high bits of multiplicand by multiplier, quotient in ECX
mul   edx           ; multiply low bits of multiplicand by multiplier, quotient in EDX:EAX
add   edx, ecx      ; add high-order product to high bits of low-order product
; EDX:EAX = product

The wording of that last comment got a little hairy… Hopefully the code makes intuitive sense. All we do is break up the multiplication into two parts, operating on the 32-bit halves of the 64-bit value independently, and then add the results together.

Integrating this multiplication code into your original code, we get something like:

    ;push ebx         ; save EBX (only needed if complying with C calling convention)
    mov  eax, 15      ; initial value (low-order bits)
    xor  edx, edx     ; initial value's high-order bits are 0
    mov  ecx, eax     ; loop counter

Factorial:
    dec  ecx          ; decrement counter
    jz   Finished     ; when counter == 0, we're done
    mov  ebx, ecx     ; make copy of counter
    imul ebx, edx     ; high-order bits * multiplier
    mul  ecx          ; low-order bits * multiplier
    add  edx, ebx     ; add high-order product to high-order bits of low-order product
    jmp  Factorial    ; go back to top of loop

Finished:
    ;pop  ebx         ; restore EBX (only needed if complying with C calling convention)
    ...

And that works! At least, it works all the way up to 20!. At 21!, you get the wrong result because of our old friend overflow. 21! doesn't fit in a 64-bit value.

It also doesn't work for 0!—instead of the mathematically-defined result of 1, you get 0. You should be able to insert the necessary comparisons and branches to fix this problem yourself.

There are some ways to optimize this code further, but at the cost of introducing additional complexity, so make sure you understand this first!

One optimization that I already alluded to is making sure that you don't do a final multiplication by 1. This requires only inserting an additional comparison at the end of the loop body:

    ;push ebx         ; save EBX (only needed if complying with C calling convention)
    mov  eax, 15      ; initial value (low-order bits)
    xor  edx, edx     ; initial value's high-order bits are 0
    mov  ecx, eax     ; loop counter

Factorial:
    dec  ecx          ; decrement counter
    jz   Finished     ; when counter == 0, we're done
    mov  ebx, ecx     ; make copy of counter
    imul ebx, edx     ; high-order bits * multiplier
    mul  ecx          ; low-order bits * multiplier
    add  edx, ebx     ; add high-order product to high-order bits of low-order product
    cmp  ecx, 1
    jg   Factorial    ; keep looping as long as counter > 1

Finished:
    ;pop  ebx         ; restore EBX (only needed if complying with C calling convention)
    ...

You could improve on this slightly by hoisting the initial comparison out of the loop:

    ;push ebx         ; save EBX (only needed if complying with C calling convention)
    mov  eax, 15      ; initial value (low-order bits)
    xor  edx, edx     ; initial value's high-order bits are 0
    mov  ecx, eax     ; loop counter
    dec  ecx          ; decrement counter
    jz   Finished     ; when counter == 0, we're done, so skip the loop

Factorial:
    mov  ebx, ecx     ; make copy of counter
    imul ebx, edx     ; high-order bits * multiplier
    mul  ecx          ; low-order bits * multiplier
    add  edx, ebx     ; add high-order product to high-order bits of low-order product
    dec  ecx          ; decrement counter
    jg   Factorial    ; keep looping as long as counter > 1

Finished:
    ;pop  ebx         ; restore EBX (only needed if complying with C calling convention)
    ...

And that about does it with the easy optimizations. For other ideas, you can explore what C compilers emit for similar code, but beware that much of this code is non-trivial. (GCC 6.3's output looks a lot like my code, but GCC 7.1 unrolls the loop for more speed but resulting in code that is much more confusing and complicated to read/understand.) Besides that, also beware that C compilers don't necessary have perfect optimizers! It is often the case that an expert assembly programmer can write more optimal code than a compiler can generate (although they can't do it as quickly!).

Extra: Would using shl eax, 1 to calculate the 2nd degree portion (n*2) for the intermediate be better than using imul for each and every degree.

No.

First of all, you really don't ever want to write shl reg, 1 unless you actually need the carry flag to be set. A left-shift by 1 is equivalent to multiplying by two, which is equivalent to adding the value to itself. So, add reg, reg is simpler, better, and faster.

But still, even that would not be better in this case. While it is true that a simple shift or add is often faster than a multiplication (but not always—multiplications are faster than you might think), the only way you could use it here inside the loop is if you first checked to see that you were supposed to be multiplying by 2, and the cost of doing that check (more specifically, the cost of making the decision as a result of that check) is far more costly than a simple integer multiplication. Why? Because the decision requires a branch, which introduces the possibility of mispredictions. Even if you only had a misprediction in the event that the multiplier == 2, that would be more costly than the difference between IMUL and SHL/ADD.

In fact, though, we can do shl reg, x for every multiplication by a power of 2—would that be faster? No, and for the same reasons. Actually, worse, because it would increase the chances of mispredictions. The condition would alternate following a pattern unlikely to be understood by a branch-prediction algorithm, resulting in mispredictions more often than not.