Factorial calculation in Motorola 68k assembly using Easy68K

I'm writing an assembly program for the Motorola 68k processor using Easy68K to compute the factorial of an integer $N$.

My current implementation:

       ORG $8000
       
num DC.W 4 ; e.g. I want to calculate the factorial of 4
result DC.L 0 ; Space to store the result

START: 
      move.w num,d0 ; d0 = 4
      ble.s ZorN
      move.w d0,d1 ;also d1 = 4
      subq.w #1,d0 ; d0 = d0 - 1 = 4 - 1 = 3
LOOP: 
     mulu.w d0,d1 ; 1st cycle: d1 = d1 * d0 = 4 * 3 = 12 [d1 is now 12].
                  ; 2nd cycle: d1 = d1 * d0 = 12 * 2 = 24 [d1 is now 24].
                  ; 3rd cycle: d1 = d1 * d0 = 24 * 1 = 24 [d1 is now 24].
                 
                  
             
     subq.w #1,d0 ; 1st cycle: d0 = d0 - 1 = 3 - 1 = 2 [d0 is now 2].
                  ; 2nd cycle: d0 = d0 - 1 = 2 - 1 = 1 [d0 is now 1].
                  ; 3rd cycle: d0 = d0 - 1 = 1 - 1 = 0 [d0 is now 0].
                  
     
     bne.s LOOP   ; Repeat the loop if d0 is not zero
     move.l d1,result 
     bra.s END    
     
 
ZorN:
     blt.s END
     moveq.l #1,d1 ;0! = 1 
     move.l d1,result 
  
END:
    SIMHALT  
    END START

My Questions:

Is my implementation correct for computing N! (factorial)?
How can I optimize the code?
- I suspect I'm using too many jumps (beq.s, blt.s, bra.s), and maybe there's a cleaner way to structure the loop.

Any suggestions for improving efficiency and readability would be greatly appreciated :-)!

Solution

The implementation seems correct, but you are still doing that one multiplication too many that Erik Eidt told you about!
I would certainly also return a defined result in case of a negative input.

First you start with an elimination process for the negative and very small numbers.
Then follow that by a multiplications loop that ends as soon as D0 becomes less or equal to 2. In case the inputted number itself were 2, then this loop will do a redundant multiplication by 1. I consider that an acceptable loss. You would hardly notice it.

  ORG $8000
       
num    DC.W 4
result DC.L 0

START: 
  clr.l   d1
  move.w  num, d0
  blt.s   Done     ; D1 = 0 for negative numbers

  moveq.l #1, d1
  cmpi.w  #2, d0
  blt.s   Done     ; D1 = 1 for 0! and 1!

  move.w  d0, d1   ; The input number is 2 or more
LOOP:
  subq.w #1, d0    ; 4 - 1 =  3,  3 -  1 =  2
  mulu.w d0, d1    ; 3 * 4 = 12,  2 * 12 = 24
  cmpi.w #2, d0
  bgt.s  LOOP
Done:
  move.l d1, result 
  SIMHALT  
END START

The loop now will do:

3 iterations for input 5  
2 iterations for input 4  
1 iteration  for input 3  
1 iteration  for input 2  
0 iterations for input 1  
0 iterations for input 0

result DC.L 0

About result being a longword. Although the mulu.w instruction produces a full longword, all next iterations in the loop will only be using the low word from that product. However, the final multiplication can leave the result above 65535, and therefore having DC.L is a good choice.