function assembly call reverse-engineering

How many arguments are passed in a function call?

I wish to analyze assembly code that calls functions, and for each 'call' find out how many arguments are passed to the function. I assume that the target functions are not accessible to me, but only the calling code. I limit myself to code that was compiled with GCC only, and to System V ABI calling convention. I tried scanning back from each 'call' instruction, but I failed to find a good enough convention (e.g., where to stop scanning? what happen on two subsequent calls with the same arguments?). Assistance is highly appreciated.

Solution

Reposting my comments as an answer.

You can't reliably tell in optimized code. And even doing a good job most of the time probably requires human-level AI. e.g. did a function leave a value in RSI because it's a second argument, or was it just using RSI as a scratch register while computing a value for RDI (the first argument)? As Ross says, gcc-generated code for stack-args calling-conventions have more obvious patterns, but still nothing easy to detect.

It's also potentially hard to tell the difference between stores that spill locals to the stack vs. stores that store args to the stack (since gcc can and does use mov stores for stack-args sometimes: see -maccumulate-outgoing-args). One way to tell the difference is that locals will be reloaded later, but args are always assumed to be clobbered.

what happen on two subsequent calls with the same arguments?

Compilers always re-write args before making another call, because they assume that functions clobber their args (even on the stack). The ABI says that functions "own" their args. Compilers do make code that does this (see comments), but compiler-generated code isn't always willing to re-purpose the stack memory holding its args for storing completely different args in order to enable tail-call optimization. :( This is hand-wavey because I don't remember exactly what I've seen as far as missed tail-call optimization opportunities.

Yet if arguments are passed by the stack, then it shall probably be the easier case (and I conclude that all 6 registers are used as well).

Even that isn't reliable. The System V x86-64 ABI is not simple.

int foo(int, big_struct, int) would pass the two integer args in regs, but pass the big struct by value on the stack. FP args are also a major complication. You can't conclude that seeing stuff on the stack means that all 6 integer arg-passing slots are used.

The Windows x64 ABI is significantly different: For example, if the 2nd arg (after adding a hidden return-value pointer if needed) is integer/pointer, it always goes in RDX, regardless of whether the first arg went in RCX, XMM0, or on the stack. It also requires the caller to leave "shadow space".

So you might be able to come up with some heuristics to will work ok for un-optimized code. Even that will be hard to get right.

For optimized code generated by different compilers, I think it would be more work to implement anything even close to useful than you'd ever save by having it.