Search code examples
objective-cclangmach-o

How does Mach-O store pointers to Objective-C Metadata entities?


A somewhat lengthy question so please bear with me.

I am writing a parser to extract Objective-C metadata entities from input Mach-O binaries. And I want to better understand how pointers to metadata entities are stored/encoded in Mach-Os.

For example, using this Obj-C code:

#import <Foundation/Foundation.h>

@interface Person : NSObject

- (void) someMethod;

@end

@implementation Person

- (void) someMethod {}

@end

int main() {
    return 0;
}

I compile it twice:

1. First using the following command:
clang++ -target arm64-apple-ios16 -isysroot /path/to/iphoneos_sdk \
  -framework Foundation -o test test.m

Output from objdump -s test

...
Contents of section __DATA_CONST.__objc_classlist:
 100008000 c0c00000 00000000                    ........        
...
Contents of section __DATA.__objc_data:
 10000c098 01000000 00001080 01000000 00001080  ................
 10000c0a8 00000000 00002080 00000000 00000000  ...... .........
 10000c0b8 00c00000 00001000 98c00000 00001000  ................
 10000c0c8 02000000 00001080 00000000 00002080  .............. .
 10000c0d8 00000000 00000000 48c00000 00000000  ........H.......

Note that the class pointer is stored as 0xc0c0 in the __objc_classlist section. The class is actually located at pointer: 0x0001 0000 c0c0 in the __objc_data section.

2. Then I compile the same input again using the following command (note that I have an Intel machine so the target here is x86_64 by default):
clang++ -framework Foundation -o test test.m

Output from objdump -s test

...
Contents of section __DATA_CONST.__objc_classlist:
 100004000 d8800000 01000000                    ........        
...
Contents of section __DATA.__objc_data:
 1000080b0 00000000 00000000 00000000 00000000  ................
 1000080c0 00000000 00000000 00000000 00000000  ................
 1000080d0 00800000 01000000 b0800000 01000000  ................
 1000080e0 00000000 00000000 00000000 00000000  ................
 1000080f0 00000000 00000000 68800000 01000000  ........h.......

In this case, the class pointer is stored as 0x0001 0000 80d8 in the __objc_classlist section and we can use that address to go to where the class is actually stored in the __objc_data section.


I also noticed other ways in which pointers are encoded. For example, I came across a case for ARM64 targets where a pointer to a metadata entity was stored as: 0x0000 9000 0000 3faf while the actual location is 0x0001 0000 3faf.


So, my question is: how does Objective-C/clang encode MD entity pointers in Mach-O files?


Solution

  • You're looking at non-linked data. You need to be aware of dynamic linking operations in order to meaningfully parse this.

    So, my question is: how does Objective-C/clang encode MD entity pointers in Mach-O files?

    It depends. Specifically, it depends on what runtime linking format for binds and rebases your binary uses. Broadly speaking, there are two formats:

    1. Dyld opcodes.
      This is the "old" format and has been used since macOS 10.6. In this format, all metadata is stored separately from the data it applies to, which is why you get clean pointers in your x86_64 binary, surrounded by zeroes. As its name suggests, it's an opcode-based sequence of instructions, which is stored somewhere in __LINKEDIT and is pointed to by the LC_DYLD_INFO/LC_DYLD_INFO_ONLY in the Mach-O header. You can dump this info specifically with xcrun dyld_info -opcodes:

      % xcrun dyld_info -opcodes test.macos
      test.macos [x86_64]:
          -opcodes:
              rebase opcodes:
                  0x0000 REBASE_OPCODE_DO_REBASE_IMM_TIMES
                  0x0018 REBASE_OPCODE_DO_REBASE_ADD_ADDR_ULEB
                  0x0050 REBASE_OPCODE_DO_REBASE_IMM_TIMES
                  0x0058 REBASE_OPCODE_DO_REBASE_IMM_TIMES
                  0x0060 REBASE_OPCODE_DO_REBASE_IMM_TIMES
                  0x0080 REBASE_OPCODE_DO_REBASE_IMM_TIMES
                  0x0088 REBASE_OPCODE_DO_REBASE_IMM_TIMES
                  0x00D0 REBASE_OPCODE_DO_REBASE_IMM_TIMES
                  0x00D8 REBASE_OPCODE_DO_REBASE_IMM_TIMES
                  0x00F8 REBASE_OPCODE_DO_REBASE_IMM_TIMES
              regular bind opcodes:
                  0x00E0 BIND_OPCODE_DO_BIND
                  0x00B0 BIND_OPCODE_DO_BIND
                  0x00B8 BIND_OPCODE_DO_BIND
                  0x00C0 BIND_OPCODE_DO_BIND_ADD_ADDR_IMM_SCALED
                  0x00E8 BIND_OPCODE_DO_BIND
              no lazy bind opcodes
              no weak bind opcodes
      

      The load command and dyld opcodes are defined in mach-o/loader.h. Use of the opcodes has been somewhat detailed by Jonathan Levin, though for the actual implementation, see MachOAnalyzer.cpp and MachOLayout.cpp in dyld source.

    2. Chained fixups. This is the "new" format first introduced in iOS 12 on arm64e. In this format, some metadata is stored alongside the target data it applies to, which is what you're seeing in your arm64 binary.
      This format was initially only used for arm64e binaries, and whether this is used depends on the target architecture and minimum OS version, but iOS 16 and macOS 13 targets now seem to use it for all architectures (I'm guessing your default macOS target is 12.x or lower).
      The way this works is by first segmenting the binary into pages (which may or may not match the hardware page size), and recording the offset of the first value that needs to be operated on in each page. The data at that offset then encodes the information needed to construct a valid pointer at load-time, as well as the offset to the next such value, thereby forming the "fixup chain". Cramming all of this into a 64-bit (or sometimes even 32-bit) value is of course no small feat, so there are many subtly different formats that can be picked from, each optimised for a special use case (see mach-o/fixup-chains.h), but generally you have the top bit telling you whether it's a bind or rebase, you have N amount of bits in the middle that encode distance to the next pointer, pointer authentication stuff, etc., and then you have the rest of the bits which encode the offset from the base of the image (for rebases) or the index into the import symbol table (for binds). Also, only one format can be chosen for the entire binary, so you will likely only have to implement two or three, and will never encounter the rest.
      At that point you're left with the list of page offsets that lead to the first value on each page. If chained fixups are used in conjunction with dyld opcodes, then this is encoded somehow (I never looked at it) in the dyld opcode sequence with BIND_OPCODE_THREADED. If this is used stand-alone, then there is a LC_DYLD_CHAINED_FIXUPS load command in the Mach-O header, which points to a struct dyld_chained_fixups_header, which points to a few more structs, encoded as offsets from itself. One of those holds the page starts, another holds the list of imported symbols, etc. See mach-o/fixup-chains.h again for those.
      You can use xcrun dyld_info -fixup_chains and xcrun dyld_info -fixup_chain_details to examine this:

      % xcrun dyld_info -fixup_chains test.ios
      test.ios [arm64]:
          -fixup_chains:
      seg[2]:
        page_size:       0x4000
        pointer_format:  6 (generic 64-bit, 4-byte stride, target vmoffset )
        segment_offset:  0x00008000
        max_pointer:     0x00000000
        pages:         1
          start[ 0]:  0x0000
      seg[3]:
        page_size:       0x4000
        pointer_format:  6 (generic 64-bit, 4-byte stride, target vmoffset )
        segment_offset:  0x0000C000
        max_pointer:     0x00000000
        pages:         1
          start[ 0]:  0x0018
      
      % xcrun dyld_info -fixup_chain_details test.ios
      test.ios [arm64]:
          -fixup_chain_details:
        0x00008000:  raw: 0x000000000000C0C0       rebase: (next: 000, target: 0x0000000C0C0, high8: 0x00)
        0x0000C018:  raw: 0x0090000000007F9C       rebase: (next: 018, target: 0x00000007F9C, high8: 0x00)
        0x0000C060:  raw: 0x0010000000007F9C       rebase: (next: 002, target: 0x00000007F9C, high8: 0x00)
        0x0000C068:  raw: 0x0050000000007F88       rebase: (next: 010, target: 0x00000007F88, high8: 0x00)
        0x0000C090:  raw: 0x0010000000007FA3       rebase: (next: 002, target: 0x00000007FA3, high8: 0x00)
        0x0000C098:  raw: 0x8010000000000001         bind: (next: 002, ordinal: 000001, addend: 0)
        0x0000C0A0:  raw: 0x8010000000000001         bind: (next: 002, ordinal: 000001, addend: 0)
        0x0000C0A8:  raw: 0x8020000000000000         bind: (next: 004, ordinal: 000000, addend: 0)
        0x0000C0B8:  raw: 0x001000000000C000       rebase: (next: 002, target: 0x0000000C000, high8: 0x00)
        0x0000C0C0:  raw: 0x001000000000C098       rebase: (next: 002, target: 0x0000000C098, high8: 0x00)
        0x0000C0C8:  raw: 0x8010000000000002         bind: (next: 002, ordinal: 000002, addend: 0)
        0x0000C0D0:  raw: 0x8020000000000000         bind: (next: 004, ordinal: 000000, addend: 0)
        0x0000C0E0:  raw: 0x000000000000C048       rebase: (next: 000, target: 0x0000000C048, high8: 0x00)
      

    In the more general case, you could also use xcrun dyld_info -fixups to display any sort of bind or rebase target, no matter whether it uses dyld opcodes or fixup chains under the hood. But I suppose that won't help you much for the purpose of writing a parser.