linux compilation arm reverse-engineering

Does the .so file still contain infomation about label

In what phase of compilation does the compiler replace label into actual addr

I understanding instructions like jmp abc where abc is just a note and will be replace to actual address eventually, does it ?

Does the final .so file still contain infomation about label or the label is replace to actual addr when its load in the memory ?

Solution

TL;DR - your question is hard to answer, because it is mixing a few concepts. For typical assembler labels, we use PC relative and labels are resolve at assemble time. For other 'external' labels, there are many cases and the resolution depends on the case.

There are four conceptual ways to address on almost all CPUs, and definitely on the ARM.

PC relative address. Current instruction +/- offset.
Absolute address. This is the one you are conceptually thinking of.
Register computed address. Calculated at run time. ldr pc, [rn, #xx]
Table based addressing. Global offset table, etc. Much like registers computed addresses. ldr pc, [Rbase, Rindex, lsl #2]

The first two fit in a single instruction and are very efficient. The first is most desirable as the code can execute at ANY address as long as it maintains it's original layout (ie, you don't load it by splitting the code up).

In the table above, there is also the concept of 'build time' and 'run time'. The distinction is the difference between a linker and a loader. You have tagged this 'linux' and refer to an 'so' or shared library. Also, you are referring to assembler 'labels'. They are very similar concepts, but can be different as they will be one of the four classes of addressing above.

Most often in assembler, the labels are PC relative. There is no additional structure to be implemented with PC relative, except to keep the chunk of code continuous. In the case of an assembler that is a 'module' (compilation unit, for a compile) or is processed by the assembler and produced an 'object', it will use a PC relative addressing.

The object format can be annotate with external addresses and there are many choices in how an assembler may output these address. They are generally controlled by 'psuedo-ops'. That is a note (separate section with defined format) in the object file; the instruction is semi-complete in this form. It may prepare to use an offset table, use a register based compute (like r9+constant), etc.

For the typical case of linking (done at build time), we will either use PC relative or absolute. If we fix our binary to only run at one address, the assembler can setup for absolute addressing and resolve these through linking. In this case, the binary must be loaded at a fixed address. The assembler 'modules' or object files can be completely glued together to have everything resolved. Then there is no 'load' time fix ups. Other determining factor are whether code/data are separate, and whether the system is using an MMU. It is often desirable to keep code constant, so that many processes can use the same RAM/ROM pages, but they will have separate data. As well as memory efficient, this can provide some form of security (although it is not extremely robust) it will prevent accidental code overwrites and will provide debugging help in the form of SIGSEGV.

It is possible to write a PC-relative initialization routine which will do the fix-ups to create a table in your own binary. So a 'loader' is just to determine where you are running and then make calculations. For statically shared libraries, you typically know the libraries you will run, but not where they are. For dynamically shared libraries, you might not even know at compile time what the library is that you will run.

A Linux distribution can use either. If you have some sort of standard Linux Desktop distribution, (Ubuntu/Debian, Redhat, etc). You will have something base on ARM ELF LSB and dynamic shared libraries. You need to use the assembler pseudo ops to support this type of addressing or use a compiler to do it for you. The majority of all 'labels' in a shared library will be PC relative and not show up. Some labels can show up for debugging reasons (man strip) and some are absolutely needed to resolve addresses at run time.

I have also asked a question that I find related some time ago, Using GCC pre-processor as an assembler... So the key concept is that the assembler is generally 'two pass' and needs to do these local address fix ups. Then this question asks a 2nd level Concept A/B where we are adding shared libraries. The online book Linkers and Loaders is a great resource if you want to known more.