I was wondering what exactly is stored in a .o or a .so file that results from compiling a C++ program. This post gives a quite good overview of the compilation process and the function of a .o file in it, and as far as I understand from this post, .a and .so files are just multiple .o files merged into a single file that is linked in a static (.a) or dynamic (.so) way.
But I wanted to check if I understand correctly what is stored in such a file. After compiling the following code
void f();
void f2(int);
const int X = 25;
void g() {
f();
f2(X);
}
void h() {
g();
}
I would expect to find the following items in the .o file:
g()
, containing some placeholder addresses where f()
and f2(int)
are called.h()
, with no placeholdersX
, which would be just the number 25
g()
, h()
and X
can be foundf()
and f2(int)
, which have to be resolved during linking.Then a program like nm
would list all the symbol names from both tables.
I suppose that the compiler could optimize the call f2(X)
by calling f2(25)
instead, but it would still need to keep the symbol X in the .o file since there is no way to know if it will be used from a different .o file.
Would that be about correct? Is it the same for .a and .so files?
Thanks for your help!
You're pretty much correct in the general idea for object files. In the "table that specifies at which addresses in the file" I would replace "addresses" with "offsets", but that's just wording.
.a files are simply just archives (an old format that predates tar, but does the same thing). You could replace .a files with tar files as long as you taught the linker to unpack them and just link with all the .o files contained in them (more or less, there's a little bit more logic to not link with object files in the archive that aren't necessary, but that's just an optimization).
.so files are different. They are closer to a final binary than an object file. An .so file with all symbols resolved can at least theoretically be run as a program. In fact, with PIE (position independent executables) the difference between a shared library and a program are (at least in theory) just a few bits in the header. They contain instructions for the dynamic linker how to load the library (more or less the same instructions as a normal program) and a relocation table that contains instructions telling the dynamic linker how to resolve the external symbols (again, the same in a program). All unresolved symbols in a dynamic library (and a program) are accessed through indirection tables which get populated at dynamic linking time (program start or dlopen
).
If we simplify this a lot, the difference between objects and shared libraries is that much more work has been done in the shared library to not do text relocation (this is not strictly necessary and enforced, but it's the general idea). This means that in object files the assembler has only generated placeholders for addresses which the linker then fills in, for a shared library the addresses are filled in with addresses to jump tables so that the text of the library doesn't need to get changed, only a limited jump table.
Btw. I'm talking ELF. Older formats had more differences between programs and libraries.