Problems linking to multiple libraries with same symbols

I have 2 shared object libraries, e.g. libA.so and libB.so. I have an executable that calls functions from both of these libraries, e.g. funcA() from libA and funcB() from libB.

Unfortunately, both libA and libB have some other functions that use the same symbol name, e.g. libA has function f1() and f2() and libB also has functions f1() and f2(). Note, the implementations of f1() and f2() are completely different between libA and libB, they are just called the same name.

The executable code is never calling f1() or f2() directly, and of course libA should only call its f1() & f2() and libB should only call its f1() and f2().

But, it seems something goes wrong, and I get a segfault when running the executable.

I have a solution for this, which is to use a "version.script" when building libA and libB, so that only the external API functions (i.e. those called "func*") are exposed, i.e. the following script...

{
    global: func*;
    local: *; 
};

... and using -Wl,--version-script=version.script when linking with gcc.

This works, but it does cause me complications because I have to use a 3rd party tool-chain that makes it complex (i.e. hacky) to use this version.script method.

Is there a better / other way? (Note, I do have the source for libA and libB, and so changing symbol names is possible, but not liked since this code is auto-generated by 3rd party tool and I don't really want to go in and edit all the names. Of course, real code is not just f1() and f2() but hundreds of similarly named symbols).

Also, I would like to understand the root cause of the problem? I'm assuming that libA (or libB) gets "confused" to which f1() or f2() it should call, but why - since both libA and libB get compiled separately - something to do with shared object libraries I suspect?

In case it matters... source for the libs and executable is C, and using gcc to compile and link, and this is on Linux.

Solution

First, a minimal illustration of your problem. To keep it minimal I'll cut the ambivalent functions down from f1(), f2() to just f1(). Since you say:

I would like to understand the root cause of the problem?

I'll explain that using the minimal illustration. Then I'll show how to avoid it without a version script.

The example

Source and header files for libA.so:

// A.h
#pragma once
extern void funcA(void);

// A.c
#include <stdio.h>
#include "A1.h"

void funcA(void)
{
    puts(__FUNCTION__);
    f1();
}
    
// A1.h
#pragma once
extern void f1(void);

// A1.c
#include <stdio.h>
#include "A1.h"

void f1(void)
{
    printf("%s from A1.c\n",__FUNCTION__);
}

Source and header files for libB.so:

// B.h
#pragma once
extern void funcB(void);

// B.c
#include <stdio.h>
#include "B1.h"

void funcB(void)
{
    puts(__FUNCTION__);
    f1();
}
    
// B1.h
#pragma once
extern void f1(void);

// B1.c
#include <stdio.h>
#include "B1.h"

void f1(void)
{
    printf("%s from B1.c\n",__FUNCTION__);
}

Source for a program:

// prog.c
#include <A.h>
#include <B.h>

int main(void)
{
    funcA();
    funcB();
    return 0;
}

Compile all the shared library sources:

$ gcc -c -fPIC A*.c B*.c

And the program source:

$ gcc -c -I . prog.c

Link the shared libraries:

$ gcc -shared -o libA.so A*.o 
$ gcc -shared -o libB.so B*.o

And the program:

$ gcc -o prog prog.o -L . -lA -lB -Wl,-rpath=$(pwd)

Then we see your problem:

$ ./prog
funcA
f1 from A1.c
funcB
f1 from A1.c

Both funcA and funcB both call the f1 from libA.so, defined in A1.c.

By re-linking the program, with libA.so and libA.so in reverse order:

$ gcc -o prog prog.o -L . -lB -lA -Wl,-rpath=$(pwd)

we can produce the opposite problem:

$ ./prog
funcA
f1 from B1.c
funcB
f1 from B1.c

Which is no less a problem.

The explanation

The ambivalent function symbol f1 is defined in the dynamic symbol tables of both libA.so and libB.so:

$ readelf -W --dyn-syms libA.so

Symbol table '.dynsym' contains 9 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTable
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND puts@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND printf@GLIBC_2.2.5 (2)
     4: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
     5: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
     6: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND __cxa_finalize@GLIBC_2.2.5 (2)
     7: 0000000000001182    31 FUNC    GLOBAL DEFAULT   14 funcA
     8: 0000000000001159    41 FUNC    GLOBAL DEFAULT   14 f1
     
$ readelf -W --dyn-syms libB.so

Symbol table '.dynsym' contains 9 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTable
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND puts@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND printf@GLIBC_2.2.5 (2)
     4: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
     5: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
     6: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND __cxa_finalize@GLIBC_2.2.5 (2)
     7: 0000000000001159    41 FUNC    GLOBAL DEFAULT   14 f1
     8: 0000000000001182    31 FUNC    GLOBAL DEFAULT   14 funcB

So in both shared libraries it is visible at runtime to the dynamic linker as an eligible definition for references to f1, and the dynamic linker will by default bind all such references to the first definition it finds in the course of loading and linking the shared libraries recursively required by the process under construction.

To run prog, that construction starts with the loading of the executable prog. Look at the top of its dynamic section:

$ readelf --dynamic prog

Dynamic section at offset 0x2d90 contains 30 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libB.so]
 0x0000000000000001 (NEEDED)             Shared library: [libA.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/home/imk/develop/so/scrap]
 ...[cut]...

This is information that the static linker has written there for the dynamic linker to read and act on. The executable needs to load libB.so and libA.so, in that order. The RUNPATH /home/imk/develop/so/scrap is a directory the runtime linker can search to find the needed shared libraries (in addition to its default search directories). That's the result of the -rpath=$(pwd) that I added to the program linkage (because I'm not going to bother properly installing these throwaway libraries).

So in this case, the first loaded shared library in which the dynamic linker finds a definition for f1 will be libB.so; that's the definition from B1.c, and it will bind all references in the program to that definition.

Let's revert to the original linkage of prog:

$ gcc -o prog prog.o -L . -lA -lB -Wl,-rpath=$(pwd)

Then, as you've guessed, the top of the dynamic section of prog will read:

$ readelf --dynamic prog

Dynamic section at offset 0x2d90 contains 30 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libA.so]
 0x0000000000000001 (NEEDED)             Shared library: [libB.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/home/imk/develop/so/scrap]
 ...[cut]...

with libA.so loaded before libB.so, and libA's definition of f1 will be the one that wins, as we saw.

The fix

Your problem arises from the static linker's default behaviour when creating a shared library: references within the shared library to dynamic symbols that it defines are not preemptively bound to the internal definition¹. There is a linker option that will override the default behaviour to make each library call it's own definition of f1. From man ld (ld is the static linker, invoked on your behalf by gcc to perform linkages):

-Bsymbolic

When creating a shared library, bind references to global symbols to the definition within the shared library, if any. Normally, it is possible for a program linked against a shared library to override the definition within the shared library. This option is only meaningful on ELF platforms which support shared libraries.

So we can relink the shared libraries like so:

$ gcc -shared -o libA.so A*.o -Wl,-Bsymbolic
$ gcc -shared -o libB.so B*.o -Wl,-Bsymbolic

(Incidentally, we use -Wl,<ld-option> to tell gcc to pass option ld-option straight through to ld.)

Relink the program with the new libraries:

$ gcc -o prog prog.o -L . -lA -lB -Wl,-rpath=$(pwd)

and then each library calls its own definition of f1:

$ ./prog
funcA
f1 from A1.c
funcB
f1 from B1.c

You could likely also solve the issue by using GCC's dynamic visibility attribute, but it would require some source modifications in the libraries.

This behaviour of the GNU linker with shared libraries is by design: it's the same as what would happen if the libraries were static ones, in keeping with the GNU linker's efforts to make symbol resolution look alike for both static archives and shared libraries.