Search code examples
compilationcompiler-optimizationdecompilingcildecompiler

Decompilation of CIL code into some high level code - do I need to introduce new variables during data flow analysis?


I'm writing a compiler from .NET CIL code to some high level language. Process is similar to decompilation. I have done some control flow analysis - detecting loops, ifs, and so on. In terms of data flow analysis i've done simple expression propagation by simulating some instructions involving evaluation stack - I treat evaluation stack as hidden variable, push more complex expressions on it, and whenever there is any assignment instruction to some variable (for example starg or stloc) - I pop and assign propagated expression from stack to this variable and translate this expression into statement in high language code. Of course for now it is so simple that it generates failures. Consider a function written in C#:

int GCD(int n1, int n2)
{
    while(n2 != 0)
    {
        int c = n1 % n2;
        n1 = n2;
        n2 = c;
    }

    return n1;
}

This function compiles to IL:

.method private hidebysig 
    instance int32 GCD (
        int32 n1,
        int32 n2
    ) cil managed 
{
    .maxstack 8

    IL_0000: br.s IL_000a
        IL_0002: ldarg.1    // load n1 on eval stack
        IL_0003: ldarg.2    // load n2 on eval stack
        IL_0004: rem        // pop n1 and n2 from stack, compute n1 % n2 and push it on stack
        IL_0005: ldarg.2    // load n2 on stack
        IL_0006: starg.s n1 // pop n2 from stack and store it in n1
        IL_0008: starg.s n2 // pop n1 % n2 from stack and store it in n2

        IL_000a: ldarg.2
        IL_000b: brtrue.s IL_0002

    IL_000d: ldarg.1
    IL_000e: ret
}

With this simple propagation we push n1 % n2 on stack, then load n2 on stack, then we have starg instruction, so we pop expression from stack and translate assignment to statement. Then we pop again, and do the same. Result code looks like this:

GCD(n1, n2) {
    while (n2) { 
        n1 = n2;
        n2 = (n1 % n2); 
    }
    return n1;
}

Translated code has changed semantics and isn't doing what it is supposed to do, because n1 is changed before computing n1 % n2. There is a need to store intermediate result of n1 % n2 in some place, for example local variable, like in original C# code. This indicates that I have to do something inverse to dead code elimination, maybe called like "necessary code introduction". I searched for some sources about methods to introduce new variables in decompilation, but I did not find any. Do you have any ideas?


Solution

  • You may try to build SSA form from bytecode. SSA form is used by most modern compilers and decompilers as an intermediate representation.

    In SSA form every assignment to the variable will produce new variable so you will be always able to refer the correct value.

    Here is algorithm description for building SSA: https://pp.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf

    After that you will need to return back to the non-SSA form. Algorithms can be found in still unfinished SSA Book: http://ssabook.gforge.inria.fr/latest/book.pdf

    Look for chapters 3.2 and 17.