some clang-generated assembly not working in real mode (.COM, tiny memory model)

First, this is kind of a follow-up to Custom memory allocator for real-mode DOS .COM (freestanding) — how to debug?. But to have it self-contained, here's the background:

clang (and gcc, too) has an -m16 switch so long instructions of the i386 instruction set are prefixed for execution in "16bit" real mode. This can be exploited to create DOS .COM 32bit-realmode-executables using the GNU linker, as described in this blog post. (of course still limited to the tiny memory model, means everything in one 64KB segment) Wanting to play with this, I created a minimal runtime that seems to work quite nice.

Then I tried to build my recently-created curses-based game with this runtime, and well, it crashed. The first thing I encountered was a classical heisenbug: printing the offending wrong value made it correct. I found a workaround, only to face the next crash. So the first thing to blame I had in mind was my custom malloc() implementation, see the other question. But as nobody spotted something really wrong with it so far, I decided to give my heisenbug a second look. It manifests in the following code snippet (note this worked flawlessly when compiling for other platforms):

typedef struct
{
    Item it;    /* this is an enum value ... */
    Food *f;    /* and this is an opaque pointer */
} Slot;

typedef struct board
{
    Screen *screen;
    int w, h;
    Slot slots[1];    /* 1 element for C89 compatibility */
} Board;

[... *snip* ...]

    size = sizeof(Board) + (size_t)(w*h-1) * sizeof(Slot);
    self = malloc(size);
    memset(self, 0, size);

sizeof(Slot) is 8 (with clang and i386 architecture), sizeof(Board) is 20 and w and h are the dimensions of the game board, in case of running in DOS 80 and 24 (because one line is reserved for the title/status bar). To debug what's going on here, I made my malloc() output its parameter, and it was called with the value 12 (sizeof(board) + (-1) * sizeof(Slot)?)

Printing out w and h showed the correct values, still malloc() got 12. Printing out size showed the correctly calculated size and this time, malloc() got the correct value, too. So, classical heisenbug.

The workaround I found looks like this:

    size = sizeof(Board);
    for (int i = 0; i < w*h-1; ++i) size += sizeof(Slot);

Weird enough, this worked. Next logical step: compare the generated assembly. Here I have to admit I'm totally new to x86, my only assembly experience was with the good old 6502. So, In the following snippets, I'll add my assumptions and thoughts as comments, please correct me here.

First the "broken" original version (w, h are in %esi, %edi):

    movl    %esi, %eax
    imull   %edi, %eax           # ok, calculate the product w*h
    leal    12(,%eax,8), %eax    # multiply by 8 (sizeof(Slot)) and add
                                 # 12 as an offset. Looks good because
                                 # 12 = sizeof(Board) - sizeof(Slot)...
    movzwl  %ax, %ebp            # just use 16bit because my size_t for
                                 # realmode is "unsigned short"
    movl    %ebp, (%esp)
    calll   malloc

Now, to me, this looks good, but my malloc() sees 12, as mentioned. The workaround with the loop compiles to the following assembly:

    movl    %edi, %ecx
    imull   %esi, %ecx             # ok, w*h again.
    leal    -1(%ecx), %edx         # edx = ecx-1? loop-end condition?
    movw    $20, %ax               # sizeof(Board)
    testl   %edx, %edx             # I guess that sets just some flags in
                                   # order to check whether (w*h-1) is <= 0?
    jle .LBB0_5
    leal    65548(,%ecx,8), %eax   # This seems to be the loop body
                                   # condensed to a single instruction.
                                   # 65548 = 65536 (0x10000) + 12. So
                                   # there is our offset of 12 again (for 
                                   # 16bit). The rest is the same ...
.LBB0_5:
    movzwl  %ax, %ebp              # use bottom 16 bits
    movl    %ebp, (%esp)
    calll   malloc

As described before, this second variant works as expected. My question after all this long text is as simple as ... WHY? Is there something special about realmode I'm missing here?

For reference: this commit contains both code versions. Just type make -f libdos.mk for a version with the workaround (crashing later). To compile the code leading to the bug, remove the -DDOSREAL from the CFLAGS in libdos.mk first.

Update: given the comments, I tried to debug this myself a bit deeper. Using dosbox' debugger is somewhat cumbersome, but I finally got it to break at the position of this bug. So, the following assembly code intended by clang:

    movl    %esi, %eax
    imull   %edi, %eax
    leal    12(,%eax,8), %eax
    movzwl  %ax, %ebp
    movl    %ebp, (%esp)
    calll   malloc

ends up as this (note intel syntax used by dosbox' disassembler):

0193:2839  6689F0              mov  eax,esi
0193:283C  660FAFC7            imul eax,edi
0193:2840  668D060C00          lea  eax,[000C]             ds:[000C]=0000F000
0193:2845  660FB7E8            movzx ebp,ax                                    
0193:2849  6766892C24          mov  [esp],ebp              ss:[FFB2]=00007B5C
0193:284E  66E8401D0000        call 4594 ($+1d40)

I think this lea instruction looks suspicious, and indeed, after it, the wrong value is in ax. So, I tried to feed the same assembly source to the GNU assembler, using .code16 with the following result (disassembly by objdump, I think it is not entirely correct because it might misinterpret the size prefix bytes):

00000000 <.text>:
   0:   66 89 f0                mov    %si,%ax
   3:   66 0f af c7             imul   %di,%ax
   7:   67 66 8d 04             lea    (%si),%ax
   b:   c5 0c 00                lds    (%eax,%eax,1),%ecx
   e:   00 00                   add    %al,(%eax)
  10:   66 0f b7 e8             movzww %ax,%bp
  14:   67 66 89 2c             mov    %bp,(%si)

The only difference is this lea instruction. Here it starts with 67 meaning "address is 32bit" in 16bit real mode. My guess is, this is actually needed because lea is meant to operate on addresses and just "abused" by the optimizer to do data calculation here. Are my assumptions correct? If so, could this be a bug in clangs internal assembler for -m16? Maybe someone can explain where this 668D060C00 emitted by clang comes from and what may be the meaning? 66 means "data is 32bit" and 8D probably is the opcode itself --- but what about the rest?

Solution

Your objdump output is bogus. It looks like it's disassembling with the assumption of 32bit address and operand sizes, rather than 16. So it thinks lea ends sooner than it does, and disassembles some of the address bytes into lds / add. And then miraculously gets back into sync, and sees a movzww that zero extends from 16b to 16b... Pretty funny.

I'm inclined to trust your DOSBOX disassembly output. It perfectly explains your observed behaviour (malloc always called with an arg of 12). You are correct that the culprit is

lea   eax,[000C]   ;  eax = 0x0C = 12.  Intel/MASM/NASM syntax
leal  12, %eax     #or AT&T syntax:

It looks like a bug in whatever assembled your DOSBOX binary (clang -m16 I think you said), since it assembled leal 12(,%eax,8), %eax into that.

leal  12(,%eax,8), %eax  # AT&T
lea   eax, [12 + eax*8]  ; Intel/MASM/NASM syntax

I could probably dig through some instruction encoding tables / docs and figure out exactly how that lea should have been assembled into machine code. It should be the same as the 32bit-mode encoding, but with 67 66 prefixes (address size and operand size, respectively). (And no, the order of those prefixes doesn't matter, 66 67 would work, too.)

Your DOSBOX and objdump outputs don't even have the same binary, so yes, they did come out differently. (objdump is misinterpreting the operand-size prefix in previous instructions, but that didn't affect the insn length until LEA.)

Your GNU as .code16 binary has 67 66 8D 04 C5, then the 32bit 0x0000000C displacement (little-endian). This is LEA with both prefixes. I assume that's the correct encoding of leal 12(,%eax,8), %eax for 16bit mode.

Your DOSBOX disassembly has just 66 8D 06, with a 16bit 0x0C absolute address. (Missing the 32bit address size prefix, and using a different addressing mode.) I'm not an x86 binary expert; I haven't had problems with disassemblers / instruction encoding before. (And I usually only look at 64bit asm.) So I'd have to look up the encodings for the different addressing modes.

My go-to source for x86 instructions is Intel's Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z. (linked from https://stackoverflow.com/tags/x86/info, BTW.)

It says: (section 2.1.1)

The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes. Either size can be the default; use of the prefix selects the non-default size.

So that's easy, everything is pretty much the same as normal 32bit protected mode, except 16bit operand-size is the default.

The LEA insn description has a table describing exactly what happens with various combinations of 16, 32, and 64bit address (67H prefix) and operand sizes (66H prefix). In all cases, it truncates or zero extend the result when there's a size mismatch, but it's an Intel insn ref manual so it has to lay out every case separately. (This is helpful for more complex instruction behaviour.)

And yes, "abusing" lea by using it on non-address data is a common and useful optimization. You can do a non-destructive add of 2 registers, placing the result in a 3rd. And at the same time add a constant, and scale one of the inputs by 2, 4, or 8. So it can do things that would take up to 4 other instructions. (mov / shl / add r,r / add r,i). Also, it doesn't affect flags, which is a bonus if you want to preserve flags for another jump or especially cmov.