Search code examples
cmallocheap-memoryglibclibc

Gaining access to heap metadata of a process from within itself


While I can write reasonable C code, my expertise is mainly with Java and so I apologize if this question makes no sense.

I am writing some code to help me do heap analysis. I'm doing this via instrumentation with LLVM. What I'm looking for is a way to access the heap metadata for a process from within itself. Is such a thing possible? I know that information about the heap is stored in many malloc_state structs (main_arena for example). If I can gain access to main_arena, I can start enumerating the different arenas, heaps, bins, etc. As I understand, these variables are all defined statically and so they can't be accessed.

But is there some way of getting this information? For example, could I use /proc/$pid/mem to leak the information somehow?

Once I have this information, I want want to basically get information about all the different freelists. So I want, for every bin in each bin type, the number of chunks in the bin and their sizes. For fast, small, and tcache bins I know that I just need the index to figure out the size. I have looked at how these structures are implemented and how to iterate through them. So all I need is to gain access to these internal structures.

I have looked at malloc_info and that is my fallback, but I would also like to get information about tcache and I don't think that is included in malloc_info.

An option I have considered is to build a custom version of glibc has the malloc_struct variables declared non-statically. But from what I can see, it's not very straightforward to build your own custom glibc as you have to build the entire toolchain. I'm using clang so I would have to build LLVM from source against my custom glibc (at least this is what I've understood from researching this approach).


Solution

  • I had a similar requirement recently, so I do think that being able to get to main_arena for a given process does have its value, one example being post-mortem memory usage analysis.

    Using dl_iterate_phdr and elf.h, it's relatively straightforward to resolve main_arena based on the local symbol:

    #define _GNU_SOURCE
    #include <fcntl.h>
    #include <link.h>
    #include <signal.h>
    #include <stdio.h>
    #include <string.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <sys/types.h>
    
    // Ignored:
    // - Non-x86_64 architectures
    // - Resource and error handling
    // - Style
    static int cb(struct dl_phdr_info *info, size_t size, void *data)
    {
      if (strcmp(info->dlpi_name, "/lib64/libc.so.6") == 0) {
        int fd = open(info->dlpi_name, O_RDONLY);
        struct stat stat;
        fstat(fd, &stat);
        char *base = mmap(NULL, stat.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
        Elf64_Ehdr *header = (Elf64_Ehdr *)base;
        Elf64_Shdr *secs = (Elf64_Shdr*)(base+header->e_shoff);
        for (unsigned secinx = 0; secinx < header->e_shnum; secinx++) {
          if (secs[secinx].sh_type == SHT_SYMTAB) {
            Elf64_Sym *symtab = (Elf64_Sym *)(base+secs[secinx].sh_offset);
            char *symnames = (char *)(base + secs[secs[secinx].sh_link].sh_offset);
            unsigned symcount = secs[secinx].sh_size/secs[secinx].sh_entsize;
            for (unsigned syminx = 0; syminx < symcount; syminx++) {
              if (strcmp(symnames+symtab[syminx].st_name, "main_arena") == 0) {
                void *mainarena = ((char *)info->dlpi_addr)+symtab[syminx].st_value;
                printf("main_arena found: %p\n", mainarena);
                raise(SIGTRAP);
                return 0;
              }
            }
          }
        }
      }
      return 0;
    }
    
    int main()
    {
      dl_iterate_phdr(cb, NULL);
      return 0;
    }
    

    dl_iterate_phdr is used to get the base address of the mapped glibc. The mapping does not contain the symbol table needed (.symtab), so the library has to be mapped again. The final address is determined by the base address plus the symbol value.

    (gdb) run
    Starting program: a.out 
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    [New Thread 0x7ffff77f0700 (LWP 24834)]
    main_arena found: 0x7ffff7baec60
    
    Thread 1 "a.out" received signal SIGTRAP, Trace/breakpoint trap.
    raise (sig=5) at ../sysdeps/unix/sysv/linux/raise.c:50
    50    return ret;
    (gdb) select 1
    (gdb) print mainarena
    $1 = (void *) 0x7ffff7baec60 <main_arena>
    (gdb) print &main_arena
    $3 = (struct malloc_state *) 0x7ffff7baec60 <main_arena>
    

    The value matches that of main_arena, so the correct address was found.

    There are other ways to get to main_arena without relying on the library itself. Walking the existing heap allows for discovering main_arena, for example, but that strategy is considerably less straightforward.

    Of course, once you have main_arena, you need all internal type definitions to be able to inspect the data.