Search code examples
c++assemblyname-manglingsymbol-tableobject-code

Why don't we write assemblers and linkers that can handle C++ identifiers?


My understanding of why we use name mangling is that assemblers and linkers can only handle C identifiers. "int foo::bar::baz<spam::eggs>(const MoreSpam&)" can't be used as a label by any existing assemblers, and existing linkers won't recognize it as a valid function signature, so it becomes something like "_ZN3foo3bar3bazIN4spam4eggsEEEiRK8MoreSpam", which is (more or less) a valid C identifier.

But this seems like a relatively trivial limitation of our tools. Is there any good reason why we can't or don't write an assembler and linker in which something like this:

int foo::bar::baz<spam::eggs>(MoreSpam const&):
    ; opcodes go here
    ret

is fine and allowed?


Solution

  • You can actually use int foo::bar::baz<spam::eggs>(const MoreSpam&) as an identifier with the GNU assembler, you just need to put the name in quotes:

    "int foo::bar::baz<spam::eggs>(MoreSpam const&)":
            ret
    
    $ as -o test.o test.s
    $ nm test.o
    0000000000000000 t int foo::bar::baz<spam::eggs>(MoreSpam const&)
    $ ld test.o
    ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
    $ nm a.out
    0000000000402000 T __bss_start
    0000000000402000 T _edata
    0000000000402000 T _end
    0000000000401000 t int foo::bar::baz<spam::eggs>(MoreSpam const&)
                     U _start
    

    One problem with this is that, aside from being a pain in a lot of contexts to deal with symbols with spaces and symbols in them, is that not all C++ mangled identifiers can be unambiguously represented a C++ source fragment. The same C++ "symbol" can have multiple mangled representations, some mangled symbols have no C++ representation.

    For example, the Itanium C++ ABI used by the GNU C++ compiler defines 5 different ways of mangling the name of the same constructor depending on what variant of the constructor is generated by the compiler. Similarly there's three different ways to mangle the name of a given destructor. The symbols _ZN3fooC1Ev and _ZN3fooC2Ev both demangle as foo::foo() and both can exist in the same program.

    Sure you can invent new C++-like syntax to represent these things, but then you're just inventing more verbose way of mangling symbols.

    Finally, perhaps the most important reason why C++ compilers mangle the names the way they do is so they can work with all sort of tools. While it's much less common today, the GNU C++ compiler can be used with assemblers other than GAS.