c++templates operator-overloading global-variables dynamic-allocation

Overloading base types with a custom allocator, and its alternatives

So, this is a bit of an open question. But let's say that I have a large application which globally overrides the various new and delete operators so that they use home-brewed jemalloc-style arenas and custom alignments.

All fine and good, but I have been running into segfault issues because other C++-based DLLs and their dependencies also use the overloaded allocators when they shouldn't (namely LLVM), putting the little custom allocator to its knees (lack of memory and more stresses).

Testing workarounds, I have wrapped (and moved) those global operators into a class, and I made all base classes inherit from it. And well, that works for classes, but not for base types. That's the problem.

Given that C++ doesn't allow useful things like having separate allocators per namespace, or limiting the new operator per executable module, what is the best way of emulating this in base data types, where I can't directly subclass an int?

The obvious way is wrapping them in a custom template, but the problem is performance. Do I have to emulate all the array and indexing operations under a second layer just so that I can malloc from a different place without having to change the rest of the functional code? There's a better way?

P.S.: I have also been thinking about using special global new/delete operators with extra parameters, while leaving the standard ones alone. Thus ensuring that I am (well, my executable module is) the only one calling those global functions. It should be a simple search-and-replace.

Solution

Well, quick update. What I did in the end to 'solve' this conundrum is to manually detect if the code that called the overridden global allocators comes from the main executable module and conditionally redirect all the external new / delete calls to their corresponding malloc / free while still using the custom arena allocator for our own internal code.

How? After doing some R&D I found that this could be done by using the _ReturnAddress() built-in on MSVC and __builtin_extract_return_addr(__builtin_return_address(0)) on GCC/Clang; and I can say that it seems to work fine so far in production software.

Now, when some C++ code from our address space wants some memory we can see where it comes from.

But, how do we find out if that address is part of some other module in our process space or our own? We might need to find out both the base and end addresses of the main program, cache them at startup as globals, and check that the return address is within bounds.

All for extremely little overhead. But, our second problem is that retrieving the base address is different in every platform. After some research I found that things were more straightforward than expected:

In Windows/Win32 we can simply do this:

#include <windows.h>
#include <psapi.h>

inline void __initialize_base_address()
{
    MODULEINFO minfo;

    GetModuleInformation(GetCurrentProcess(), GetModuleHandle(NULL), &minfo, sizeof(minfo));

    base_addr = (uintptr_t) minfo.lpBaseOfDll;
    base_end  = (uintptr_t) minfo.lpBaseOfDll + minfo.SizeOfImage;
}

In Linux there are a thousand ways of doing this, including linker globals and some debuggey (verbose and unreliable) ways of walking the process module table. I was looking at the linker map output and noticed that the _init and _fini functions always seem to wrap the rest of the .text section symbols. Sometimes it's hard to get to the simplest solution that works everywhere:
```
#include <link.h>

inline void __initialize_base_address()
{
    void *handle = dlopen(0, RTLD_NOW);

    base_addr = (uintptr_t) dlsym(handle, "_init");
    base_end  = (uintptr_t) dlsym(handle, "_fini");

    dlclose(handle);
}
```
While in macOS things are even less documented and I had to cobble together my own thing using the Darwin kernel open-source code and tracking down some obscure low-level tools as reference. Keep in mind that _NSGetMachExecuteHeader() is just a wrapper for the internal _mh_execute_header linker global. If you need to do anything about parsing the Mach-O format and its structures then getsect.h is the way to go:
```
#include <mach-o/getsect.h>
#include <mach-o/ldsyms.h>
#include <crt_externs.h>

inline void __initialize_base_address()
{
    size_t size;

    void *ptr = getsectiondata(&_mh_execute_header, SEG_TEXT, SECT_TEXT, &size);

    base_addr = (uintptr_t) _NSGetMachExecuteHeader();
    base_end  = (uintptr_t) ptr + size;
}
```

Another thing to keep in mind is that this some-other-cpp-module-is-using-our-internal-allocator-that-globally-overrides-new-causing-weird-bugs issue seems to be a problem in Linux and maybe macOS, I didn't have this issue in Windows, probably because no conflicting DLLs were loaded in the process, being mostly C API-based. I think, or maybe the platform uses different C++ runtimes for each module.

The main issue I had was caused by Mesa3D, which uses LLVM (pure C++ in and out) for many of their GLSL shader compilers and liked to gobble up big chunks of my small custom-tailored memory arena uninvited.

Rewriting a legacy program that is structurally dependent on these allocators was out of the question due to its sheer size and complexity, so this turned out to be the best way of making things work as expected.

It's only a few lines of optional, sneaky, extra per-platform code.