loops assembly x86-16 emu8086 microprocessors

How can I write a code that loops 256 times using only 3 instructions and one 8-bit register (8086 instruction set)?

This was a question that was previously posed by a prof of mine and I'm assuming the 8-bit register is either CL or CH. I got it working by simply moving 01H to the CH register, but I was wondering if there was any other way of doing this since I am technically using the 16-bit CX register as a whole when running the code.

My code for reference :

MOV CH,01H
L1:INC AX    ;to keep count
LOOP L1

Solution

You're right, your code uses 16-bit CX. Much worse, it depends on CL being zero before this snippet executes! An 8-bit loop counter that starts at zero will wrap back to zero 256 decrements (or increments) later.

   mov  al, 0     ; uint8_t i = 0.   xor ax,ax is the same code size but zeros AH
loop_top:             ; do {
   dec  al
   jnz  loop_top  ; }while(--i != 0)

Nothing in the question said there needed to be any work inside the loop; this is just an empty delay loop.

Efficiency notes: dec ax is smaller than dec al, and loop rel8 is even more compact than dec/jnz. So if you were optimizing for real 8086 or 8088, you'd want to keep the loop body smaller because it runs more times than the code ahead of the loop. Of course if you actually wanted to just delay, this would delay longer since code-fetch would take more memory accesses. Overall code size is the same either way, for mov ax, 256 (3 bytes) vs. xor ax,ax (2 bytes) or mov al, 0 (2 bytes).

This works the same with any 8-bit register; AL isn't special for any of these instructions, so you'd often want to keep it free for stuff that can benefit from its special encodings for stuff like cmp al, imm8 in 2 bytes instead of the usual 3.

(mov al, 0 vs. xor al,al - false dependency either way on many modern CPUs. mov ah,0 might avoid a false dependency on Skylake; at least mov from another register does but maybe not immediate. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. Anyway, xor-zeroing is generally not useful on byte registers.)