Search code examples
assemblyx86micro-optimization

Access of struct member faster if located <128 bytes from start?


From Anger Fog's C++ optimization manual, I read:

The code for accessing a data member is more compact if the offset of the member relative to the beginning of the structure or class is less than 128 because the offset can be expressed as an 8-bit signed number. If the offset relative to the beginning of the structure or class is 128 bytes or more then the offset has to be expressed as a 32-bit number (the instruction set has nothing between 8 bit and 32 bit offsets). Example:

// Example 7.40
class S2 {
public:
int a[100]; // 400 bytes. first byte at 0, last byte at 399
int b; // 4 bytes. first byte at 400, last byte at 403
int ReadB() {return b;}
};

The offset of b is 400 here. Any code that accesses b through a pointer or a member function such as ReadB needs to code the offset as a 32-bit number. If a and b are swapped then both can be accessed with an offset that is coded as an 8-bit signed number, or no offset at all. This makes the code more compact so that the code cache is used more efficiently. It is therefore recommended that big arrays and other big objects come last in a structure or class declaration and the most often used data members come first. If it is not possible to contain all data members within the first 128 bytes then put the most often used members in the first 128 bytes.

I have tried this and I see no difference in the assembly output of this test program, as shown here:

class S2 {
public:
    int a[100]; // 400 bytes. first byte at 0, last byte at 399
    int b; // 4 bytes. first byte at 400, last byte at 403
    int ReadB() { return b; }
};

// Changed order of variables a and b!
class S3 {
public:
    int b; // 4 bytes. first byte at 400, last byte at 403
    int a[100]; // 400 bytes. first byte at 0, last byte at 399
    int ReadB() { return b; }
};

int main()
{
    S3 s3; s3.b = 32;
    S2 s2; s2.b = 16;
}

The output is

push    rbp
mov     rbp, rsp
sub     rsp, 712
mov     DWORD PTR [rbp-416], 32
mov     DWORD PTR [rbp-432], 16
mov     eax, 0
leave
ret

Clearly, mov DWORD PTR is used for both cases.

  1. Can someone explain why this is?
  2. Can someone explain what is meant by "the instruction set has nothing between 8 bit and 32 bit offsets" (I'm new to ASM) and what this statement suggests that I should be seeing in the ASM?

Solution

  • You're meant to be looking at the asm for ReadB, not main; but since they are defined inline, no asm is generated unless you call them (and then it would be mixed in with the code of the calling function). Let's move them out-of-line to make it easier.

    class S2 {
    public:
        int a[100];
        int b;
        int ReadB();
    };
    
    int S2::ReadB() { return b; }
    

    And so on.

    Also, just looking at the asm code won't show you the size of the instructions. You want to look at the actual machine code bytes. Checking "Output : Compile to binary" in godbolt will do that; on a real machine you can compile to an object file and dump with objdump --disassemble or a similar disassembly tool that shows machine code.

    See https://godbolt.org/z/bf7KjK for an updated version.

    Each of these functions takes a this pointer in rdi, and needs to move this->b into eax. So it needs to load a dword from memory at the address given by rdi plus the offset of b in the relevant class. Now you can see that:

    • When b is after a, you get 8b 87 90 01 00 00 (6 bytes) for mov eax, DWORD PTR [rdi+0x190]

    • When b is at the very beginning of the class, you get 8b 07 (2 bytes) for mov eax, DWORD PTR [rdi]

    • When b is before a but after a new int member other, you get 8b 47 04 for mov eax, DWORD PTR [rdi+0x4].

    There are three different addressing modes being used here, that can specify the address to be loaded from in three ways:

    • as a register (needing two bytes for the instruction),

    • as a register plus a signed 8-bit displacement (occupying 1 additional byte),

    • as a register plus a signed 32-bit displacement (occupying 4 additional bytes).

    If the necessary displacement is nonzero but fits in 8 bits, you can use the second form. If it doesn't, then you are stuck with the third form, making your code 3 bytes larger. (As prl points out, this doesn't necessarily make it slower, but it tends to, since it will use up more precious cache.)

    "Nothing between" refers to the idea that you might wish there was a form with, say, a 16-bit displacement, which would be big enough for the displacement 400 but use only two additional bytes. But there's not.