I'm writing a compiler from .NET CIL code to some high level language. Process is similar to decompilation. I have done some control flow analysis - detecting loops, ifs, and so on. In terms of data flow analysis i've done simple expression propagation by simulating some instructions involving evaluation stack - I treat evaluation stack as hidden variable, push more complex expressions on it, and whenever there is any assignment instruction to some variable (for example starg
or stloc
) - I pop and assign propagated expression from stack to this variable and translate this expression into statement in high language code. Of course for now it is so simple that it generates failures. Consider a function written in C#:
int GCD(int n1, int n2)
{
while(n2 != 0)
{
int c = n1 % n2;
n1 = n2;
n2 = c;
}
return n1;
}
This function compiles to IL:
.method private hidebysig
instance int32 GCD (
int32 n1,
int32 n2
) cil managed
{
.maxstack 8
IL_0000: br.s IL_000a
IL_0002: ldarg.1 // load n1 on eval stack
IL_0003: ldarg.2 // load n2 on eval stack
IL_0004: rem // pop n1 and n2 from stack, compute n1 % n2 and push it on stack
IL_0005: ldarg.2 // load n2 on stack
IL_0006: starg.s n1 // pop n2 from stack and store it in n1
IL_0008: starg.s n2 // pop n1 % n2 from stack and store it in n2
IL_000a: ldarg.2
IL_000b: brtrue.s IL_0002
IL_000d: ldarg.1
IL_000e: ret
}
With this simple propagation we push n1 % n2
on stack, then load n2
on stack, then we have starg
instruction, so we pop expression from stack and translate assignment to statement. Then we pop again, and do the same. Result code looks like this:
GCD(n1, n2) {
while (n2) {
n1 = n2;
n2 = (n1 % n2);
}
return n1;
}
Translated code has changed semantics and isn't doing what it is supposed to do, because n1
is changed before computing n1 % n2
. There is a need to store intermediate result of n1 % n2
in some place, for example local variable, like in original C# code. This indicates that I have to do something inverse to dead code elimination, maybe called like "necessary code introduction". I searched for some sources about methods to introduce new variables in decompilation, but I did not find any. Do you have any ideas?
You may try to build SSA form from bytecode. SSA form is used by most modern compilers and decompilers as an intermediate representation.
In SSA form every assignment to the variable will produce new variable so you will be always able to refer the correct value.
Here is algorithm description for building SSA: https://pp.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf
After that you will need to return back to the non-SSA form. Algorithms can be found in still unfinished SSA Book: http://ssabook.gforge.inria.fr/latest/book.pdf
Look for chapters 3.2 and 17.