Print sum of squared digits of an integer (TASM)

I'm trying to solve one simple task in assembly (TASM), namely:

There is a natural number with a range in the word, determine the sum of digits in the second degree by this number.

I want to output to DOS the result of adding 6^2 + 1^2 + 3^2 + 1^2 + 3^2.

The below code can only output a number in DOS, no more, given by our instructor.

;stack segments
stk segment stack 
    db  128 dup(?)
stk ends

;data segment 
data    segment para public 'data'
x   dw  61313
DThousands  dw  ?
Thousands   dw  ?
Hundreds    dw  ?
Decades     dw  ?
Units       dw  ?
result      dw  ?
data    ends    

;command segment 
code    segment para public 'code'      
    assume  cs:code, ds:data, ss:stk
begin:  
    mov ax, data
    mov ds, ax
    mov ax, x ; заносим число x в регистр ax
    mov result, ax ; заносим в зарезервированный участок памяти result значение из ax
    mov     ax, result ; меняем значение
    xor cx, cx  ;MOV CX, 0
    mov bx, 10 ; bx = 10
m_do_while:
    xor dx, dx ; обнуление dx
    div bx ; деление ax на bx
    push    dx ; заталкиваем dx в стек
    inc cx ; увеличиваем cx на 1
    cmp ax, 0 ; сравниваем регистр ax с нулем
    jne m_do_while ; выполняем условный переход
    mov ah, 2 ; помещаем в регистр ah 2
m_for:
    pop dx ; достаем из стека значение dx
    add dx, 30h ; прибавляем к dx 30h
    int 21h ; системное прерывание 
    loop    m_for ; цикл
back:
;end of program
    mov ax, 4C00h
    int 21h
code    ends
    end begin

Solution

You're already getting the digits one at a time with that printing loop. Addition is associative so it doesn't matter what order you get them in, you can add starting with the least-significant digit.

digit_sum:
    mov   ax, x       ; input in AX
    mov   bx, 10      ; base 10
    xor   cx, cx      ; sum
.sumloop:
    xor   dx, dx
    div   bx          ; quotient in AX,  remainder (the digit) in DX

  ;; With 386
    ;imul  dx, dx      ; requires 386
    ;add   cx, dx      ; sum += digit^2

  ;; Without 386
    xchg   ax, dx
    mul    al          ; result in AX.  DX untouched.  single-digit numbers fit in AL
    add    cx, ax      ; sum += digit^2
    mov    ax, dx

    test  ax, ax
    jne  .sumloop

;;; sum in CX
    ret

Then print cx efficiently, e.g. by converting into a buffer starting from the end and then making one print system call. (How do I print an integer in Assembly Level Programming without printf from the c library?). I wouldn't recommend that clunky push/pop 2-loop method you show in the question, but it's popular and does work. Anyway, mov ax, cx would put the sum in AX.

You could even get some code-reuse by using a divide-by-10-and-push loop like you have, or like in Assembly 8086 | Sum of an array, printing multi-digit numbers. The first time, use it to get digits which you pop and square -> sum. The second time, use it to generate digits of the sum, which you pop and print. (But writing a function that leaves a variable amount of stuff on the stack is tricky; you could pop the return address at the start of the function, then push/ret. Or just make it a macro that you use twice, so it inlines both places.)

If you wanted to be 8086-compatible but tune for more recent Intel CPUs (where xchg is 3 uops and thus costs about the same as 3 mov instructions): Instead of xchg ax,dx you could use mov si, ax / mov ax, dx, then mul/add, then mov ax, si. For actual ancient 8086, xchg is great: smaller is faster (except for really slow instructions like mul and div) and xchg-with-ax is only 1 byte.

Of course if you actually care about speed you'd use a multiplicative inverse to divide by 10. And for actual 8086 where mul is quite slow (but not as slow as div), you might use a lookup-table of squares to save a mul in that part:

    ; given a digit in DX, add its square to CX, indexing a table of words
    mov si, dx
    shl si,1
    add cx, [table + si]

Or with just a table of bytes, trading 1 byte of extra code-size for a smaller table and 1 fewer byte of data loaded (break even on 8088, except for prefetch differences):

    mov si, dx
    add cl, [table + si]
    adc ch, 0              ; carry to the high half of CX