Search code examples
compilationllvminstrumentationdecompiler

Why we need to instrument binary when we have a decompiler?


I'm studying the binary instrumentation techniques and I found many papers claim that binary instrumentation is necessary when the source code is not available.

While maybe we cannot get the original source code, a semantical equivalent one from the decompiler is possible (like RetDec), which, in my mind, is sufficient for many tasks previously done by binary instrumentation, e.g., software fault isolation. Sometimes we even don't have to decompile the binary to the source code -- LLVM IR is enough for many code instrumentation and analysis. And the performance might be even better since we still have the optimizations in the middle end afterwards.

My guess is that (1) the decompiler cannot recover the code well enough for most binary instrumentation task, or (2) the decompiler can only decode a small portion of binary, or (3) it takes long long time for decompiler to recover a big library but binary instrumentation only takes a short time.

Is one of my guesses correct? What is the fact here?

EDIT: Among many binary instrumentation tasks, my focus is mainly on the memory address isolation, which is usually done by masking the address or setting a guard zone in the assembly. Just curious why not adding some checking code in the LLVM IR level if we can decompile the binary to such representation.


Solution

  • Basically, the problem is that decompilers are "incomplete" in that they can't handle all possible binaries. Then too, with both decompilers and binary instrumentation, there's the problem of determining what in the binary is code and what is data -- it's generally undecidable and you just want to instrument the code, not alter the data.

    With binary instrumentation, you can more readily deal with this incrementally, only instrumenting what you know to be code, with "instrumentation" where execution might leave the known code to interrupt and instrument more (or when what was thought to be code is accessed as data, "undo" the instrumentation for the access).

    As with everything, there are performance tradeoffs -- the most extreme instrumentation is using an emulator to execute the code while extracting information, but the cost of that is high. Partial instrumentation by inserting breakpoints or inserting code has much lower cost, but is less complete. Decompiling and recompiling may allow for lower runtime cost but higher up-front cost.