Why do program-level constructors get called by `__libc_csu_init` but destructors don't get called by `__libc_csu_fini`?

Here's a simple program:

void __attribute__ ((constructor)) dumb_constructor(){}

void __attribute__ ((destructor)) dumb_destructor(){}

int main() {}

I compile it with the following flags:

g++ -O0 -fverbose-asm -no-pie -g -o main main.cpp

I check with gdb that __libc_csu_init is calling the function I tagged w/ constructor:

Breakpoint 1, dumb_constructor () at main.cpp:1
1   void __attribute__ ((constructor)) dumb_constructor(){}
(gdb) bt
#0  dumb_constructor () at main.cpp:1
#1  0x000000000040116d in __libc_csu_init ()
#2  0x00007ffff7abcfb0 in __libc_start_main () from /usr/lib/libc.so.6
#3  0x000000000040104e in _start ()

and I assume that destructor attribute would mean dumb_destructor() would be called during __libc_csu_fini, but that's not happening:

Breakpoint 1, dumb_destructor () at main.cpp:3
3   void __attribute__ ((destructor)) dumb_destructor(){}
(gdb) bt
#0  dumb_destructor () at main.cpp:3
#1  0x00007ffff7fe242b in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ffff7ad4537 in __run_exit_handlers () from /usr/lib/libc.so.6
#3  0x00007ffff7ad46ee in exit () from /usr/lib/libc.so.6
#4  0x00007ffff7abd02a in __libc_start_main () from /usr/lib/libc.so.6
#5  0x000000000040104e in _start ()

I sanity checked that __libc_csu_fini really isn't doing anything with objdump and it really is a stub:

0000000000401190 <__libc_csu_fini>:
  401190:   f3 0f 1e fa             endbr64 
  401194:   c3                      ret

Why do we call this _dl_fini? What is _dl_fini? Why is it being inconsistent and not calling __libc_csu_fini?

Solution

I refer to the most recent glibc version tag as of writing this, which is glibc 2.34 (published in August 2021), and which changed quite a bit of the startup process (I highlight the major differences). Most findings should also apply to other versions and architectures. The ELF dumps in this answer are from an x86-64 system.

Before we can look into the destructors, we have to understand what is going on at startup.

What does actually happen when we run a program?

Kernel: Load the program binary and the dynamic linker

I skip some kernel-mode parts here for brevity. We start at a point where our program's ELF file is already mapped into memory according to its segment ("program header") table:

$ readelf -l a.out

Elf file type is DYN (Shared object file)
Entry point 0x10a0
There are 13 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000002d8 0x00000000000002d8  R      0x8
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000628 0x0000000000000628  R      0x1000
  LOAD           0x0000000000001000 0x0000000000001000 0x0000000000001000
                 0x0000000000000215 0x0000000000000215  R E    0x1000
  LOAD           0x0000000000002000 0x0000000000002000 0x0000000000002000
                 0x00000000000001a0 0x00000000000001a0  R      0x1000
  LOAD           0x0000000000002da8 0x0000000000003da8 0x0000000000003da8
                 0x0000000000000268 0x0000000000000270  RW     0x1000
...(and a few more)

Our application is dynamically linked (i.e., the ELF file does not contain all functions it calls), so we have to load all dependencies into the process's virtual address space as well. However, the kernel itself has only limited understanding of the ELF format, and should not make too many assumptions about the user space environment anyway. Thus, ELF specifies a special interpreter program, which path can be found in the INTERP segment.

On Linux, this usually happens to be the dynamic linker lib64/ld-linux-x86-64.so.2. The kernel subsequently loads that dynamic linker ELF into the same virtual address space as our application and then calls the dynamic linker's entry point (not the entry point of our application).

Dynamic linker: Load and initialize dependencies

The dynamic linker now reads the DYNAMIC segment (dynamic table) of our program, which contains information about needed dependencies, symbol tables, relocations and so on:

$ readelf -d a.out

Dynamic section at offset 0x2dc8 contains 27 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x1000
 0x000000000000000d (FINI)               0x1208
 0x0000000000000019 (INIT_ARRAY)         0x3da8
 0x000000000000001b (INIT_ARRAYSZ)       16 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x3db8
 0x000000000000001c (FINI_ARRAYSZ)       16 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x3a0
 0x0000000000000005 (STRTAB)             0x470
 0x0000000000000006 (SYMTAB)             0x3c8
 0x000000000000000a (STRSZ)              130 (bytes)
...(and a few more)

With that information it starts visiting all NEEDED dependencies of our program, recursively. For each dependency, the following steps are executed:

Map the corresponding ELF file into virtual memory.
Parse its dynamic table and load dependencies.
Run dl_init, which calls all functions from the INIT/INIT_ARRAY dynamic table entries (i.e., the library's constructors).

Once the dynamic linker is done and all dependencies are loaded and initialized, it hands over control to our application's entry point (_start).

Our program: Initialize libc and run constructors

_start gets a few arguments, most notably a function pointer to _dl_fini in rdx. _start then prepares the stack, places some arguments in registers and finally calls __libc_start_main.

__libc_start_main receives the following arguments:

A function pointer to main (which is the main method we wrote)
argc, argv
A function pointer init (pointing to __libc_csu_init before glibc 2.34)
A function pointer fini (pointing to __libc_csu_fini before glibc 2.34)
A function pointer rtld_fini (which equals the rdx argument of _start and thus points to _dl_fini)

The function does some initialization of libc, sets up thread local storage and stack canaries, and a lot more. Here we only care for two calls:

__cxa_atexit ((void (*) (void *)) rtld_fini, NULL, NULL);, which registers _dl_fini as destructor to run after program exit
A call either to init = __libc_csu_init (< glibc 2.34) or to call_init (>= glibc 2.34)

Both __libc_csu_init and call_init do basically the same thing: They run all constructors registered in the dynamic table entries INIT and INIT_ARRAY. However, while __libc_csu_init is statically compiled into our program, call_init lives in libc and thus in a different memory region. This was changed after security researchers found a ROP gadget in __libc_csu_init's assembly code.

We thus observe the following backtrace for each constructor:

my_constructor()
__libc_csu_init() (< glibc 2.34) or call_init() (>= glibc 2.34)
__libc_start_main()
_start()

After __libc_start_main is done, it transfers control to our main method:

_Noreturn static __always_inline void
__libc_start_call_main (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
                        int argc, char **argv MAIN_AUXVEC_DECL)
{
  exit (main (argc, argv, __environ MAIN_AUXVEC_PARAM));
}

We now have seen what happens when an executable is initialized. But what about the end?

Running finalizers

As we can see in the code snippet above, exit runs as soon as main returns. So what does exit do?

Turns out, it only transfers control to __run_exit_handlers:

void
exit (int status)
{
  __run_exit_handlers (status, &__exit_funcs, true, true);
}

__run_exit_handlers then calls the various functions which have been registered in the __exit_funcs list via calls like __cxa_atexit. If we now look back at the startup procedure, we see that this list should also contain our _dl_fini function, as it was passed as rtld_fini argument to _start/__libc_start_main!

_dl_fini is the finalizer of the dynamic linker, which iterates through all dependencies and our executable and runs the destructors from FINI and FINI_ARRAY for each of them.

We thus get the following backtrace for each destructor:

my_destructor()
_dl_fini()
__run_exit_handlers()
exit()
__libc_start_main()
_start()

This answers the "what", but not the "why".

Why is it being inconsistent and not calling `__libc_csu_fini`?

(please take the following with a grain of salt - I could not find sources for the original reasoning, but inferred that from the source code, commit messages and some comments)

I believe that actually the contrary was intended: To be more consistent. The dynamic linker took care of running the constructors of all dependencies, so it should also run their destructors. And as our program is not much different to those dependencies, why not run its destructors as well? Probably that is the reason why __libc_csu_fini was disabled around 17 years ago. I am not sure why it wasn't removed completely - probably to keep compatibilty with existing compilers.

With the recent release of glibc 2.34, both the __libc_csu_init and __libc_csu_fini functions were removed entirely, as their tasks are now done by other parts of the runtime.

So why does the dynamic linker not run our program's constructors in `dl_init`?

Well, dl_init runs before our app's entry point _start - where several important parts of the runtime are not yet available (the initialization is done in __libc_start_main). So our constructors would need to be self-contained and avoid calling external functions. As this would pose quite a risk for reliability and security, the constructors are instead executed after all other initialization is done.

Actually, there is support for initialization functions which are executed by dl_init - these may be specified via the PREINIT and PREINIT_ARRAY dynamic table entries, and run before our _start function. However, there does not appear to be a straightforward way to register these with the compiler, and it is not recommended for the above reasons anyway.

Note: Answering this question took a lot of digging into the inner workings of glibc, which turned out to be even more complex than I initially expected. In order to make this a coherent answer, I had to simplify a few things and skip others. If you find anything inaccurate, please feel free to edit or raise this in the comments.