Search code examples
gccassemblyarmllvminstrumentation

How to instrument specific assembly instructions and get their arguments


Given any C/C++ source.c compiled with gcc

int func()
{
    // bunch of code
    ...
}

will result in some assembly (example). . .

func():
  str fp, [sp, #-4]!
  add fp, sp, #0
  sub sp, sp, #12
  mov r3, #0
  str r3, [fp, #-8]
  mov r3, #55
  sub sp, fp, #0
  ldr fp, [sp], #4
  bx lr

. . . which eventually gets turned into a binary source.obj

What I want is the ability to specify: before each assembly instruction X, call my custom function and pass as arguments the arguments of instruction X

I'm really only interested in whether a given assembly instruction executes. If I say I care about mult, I'm not necessarily saying I care whether a multiplication occurred in the original source. I understand that multiply by 2^N will result in a shift instruction. I get it.

Let's say I specify mov as the asm of interest.

The resulting assembly would be changed to the following

func():
  str fp, [sp, #-4]!
  add fp, sp, #0
  sub sp, sp, #12
  // custom code inserted here:
  // I want to call another function with the arguments of **mov**
  mov r3, #0
  str r3, [fp, #-8]
  // custom code inserted here:
  // I want to call another function with the arguments of **mov**
  mov r3, #55 
  sub sp, fp, #0
  ldr fp, [sp], #4
  bx lr

I understand that the custom code may have to push/pop any registers it uses depending how much gcc "knows" about it with respect to the registers it uses. The custom function may be a naked function

WHY

To toggle a pin to do real-time profiling every time instruction X is executed.
To record every time the arguments of X meet certain criteria.


Solution

  • Your question is unclear (even with the additional edit; the -finstrument-functions is not transforming assembler code, it is changing the way the compiler works, during optimizations and code generation; it works on intermediate compiler representations - probably at the GIMPLE level, not at the assembler or RTL level).

    Perhaps you could code some GCC plugin which would work at the GIMPLE level (by adding an optimization pass transforming the appropriate GIMPLE; BTW the -finstrument-functions option is adding more passes). This could take months of work (you need to understand the internals of GCC), and you'll add your own instrumentation generating pass in the compiler.

    Perhaps you are using some asm in your code. Then you might use some preprocessor macro to insert some code around it.

    Perhaps you want to change your ABI or calling conventions (or the way GCC is generating assembler code). Then you need to patch the compiler itself (and implement a new target in it). This might require more than a year of work.

    Be aware of various optimizations done by GCC. Sometimes you might want volatile asm instead of just asm.

    My documentation page of GCC MELT gives many slides and links which should help you.

    Is it possible to do this with any compiler?

    Both GCC and Clang are free software, so you can study their source code and improve it for your needs. But both are very complex (many millions of lines of source code), and you'll need several years of work to fork them. By the time you did that, they would evolve significantly.

    what I’d like to do is choose a set of assembly instructions - like { add, jump } - and tell the compiler to insert a snippet of my own custom assembly code just before any instruction in that set

    You should read some book on compilers (e.g. the Dragon Book) and read another book on Instruction Set Architecture and Computer Architecture. You can't just insert arbitrarily some instructions in the assembler code generated by the compiler (because what you insert requires some processor resources that the compiler did manage, e.g. thru register allocation etc...)

    after edition

    // I want to call another function with the arguments of mov

     mov r3, #0
    

    This is not possible (or very difficult) in general. Because calling that other function will use r3 and spoil its content.

    gcc -c source.c -o source.obj

    is the wrong way to use GCC. You want optimization (specially for production binaries). If you care about assembler code, use gcc -O -Wall -fverbose-asm -S source.c (perhaps -O2 -march=native instead of -O ...) then look into source.s

    Let's say I specify mul as the asm of interest.

    Again, that is the wrong approach. You care about multiplication in the source code, or in some intermediate representation. Perhaps mul might be emitted for x*3 without -O but probably not with -O2

    think and work at the GIMPLE level not at the assembler level.

    examples

    First, look into the source code of GCC. It is free software. If you want to understand how -finstrument-functions really works, take a few months to read about GCC internals (I gave links and references), study the actual source code of GCC, and ask on gcc@gcc.gnu.org after that.

    Now, imagine you want to count and instrument how many multiplications are done (which is not the same as how many IMUL instruction, e.g. because 8*x will probably be optimized as a shift machine code instruction). Of course it depends upon the optimizations enabled, and you'll work at the GIMPLE level. You'll probably increment some counter at the end of every GCC basic blocks. So after each BB exit you'll insert an additional GIMPLE statement. Such a simple instrumentation could need months of work.

    Or imagine that you want to instrument loads to detect, when possible, undefined behavior or addressing issues. This is what the address sanitizer is doing. It tooks several years of work.

    Things are much more complex than what you believe.

    (it is not in vain that GCC has about ten millions of source code lines; C compilers need to be complex today.)

    If you don't care about the C source code, you should not care about GCC. The assembler code could be produced by Bones, by Clang, by a JVM implementation, by ocamlopt etc (and all these don't even use GCC). Or could be produced by some other version of GCC (not the one you are instrumenting).

    So spend a few weeks reading more about compilers, then ask another question. That question should mention what kind of binary or of assembler you want to instrument. Instrumenting assembler code (or binary executable) is a lot harder than instrumenting GCC (and don't use textual techniques at all). It extracts first an abstracted form of the control flow graph and refines and reasons on it.

    BTW, you'll find lots of textbooks and conferences on both source instrumentation and binary instrumentation (these topics are different, even if in relation). Spend a few months reading them. Your naive textual approaches have some 1960-s smells which won't scale and won't work on today's software.

    See also this talk (and video): Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” CppCon 2017