Search code examples
decompiler

Decompiler - how to structure loops


I'm writing a decompiler for a simple scripting language. Here is what I've done:

Basic blocks

Created a collection of basic blocks as described here:

http://www.backerstreet.com/decompiler/basic_blocks.php

Control flow graph, dominator tree and loop sets

From this I could create the control flow graph.

http://www.backerstreet.com/decompiler/control_flow_graph.php

From the CFG I created the dominator tree, which in turn allowed me to find the loop sets in the CFG as described here:

http://www.backerstreet.com/decompiler/loop_analysis.php

Here is an image containing all of my data so far:

Control flow diagram

Structuring loops

My next step should be:

http://www.backerstreet.com/decompiler/creating_statements.php

And this is where my question comes in because I'm totally stuck. In my given data how would the structuring loops algorithm apply? I don't understand why it starts by trying to structure everything as a do while loop - in my example this means that "temp5_14 = temp5_14 + 16" in block 3 would always get executed at least once which is not what the original code does at all.

How could that work and how would the next pass of converting it from a do-while to a while loop actually work? For loop 3 that ends in block 6 this looks like it should be a while(true) - but how would this work with the algorithm when its head block is an if statement?

TL;DR - someone please explain with examples how the "structuring loops" algorithm actually works.


Solution

  • Structuring is the hardest part of decompiler development (at least for high level languages). That's a fairly simplistic algorithm, so it's a good starting point, but you'll likely want to use a better algorithm or make your own if you're working on a real decompiler.

    With that out of the way, the answer to your actual question about how it can use do-while loops instead of while loops is already answered on the page you linked to.

    Every loop can be described with a "do-while" statement.

    The "while" loop (pre-tested loop) is a special case of a "do-while" loop, where the bottom condition is always true, and the first statement of the loop is an "if" that jumps out of the loop.

    Say you had something like

    beforeloop
    while(foo) {
    stmt1
    stmt2
    }
    afterloop
    

    It would be compiled to something along the lines of

    beforeloop
    
    LOOPBEGIN:
    if !foo goto LOOPEND
    stmt1
    stmt2
    goto LOOPBEGIN
    
    LOOPEND:
    afterloop
    

    The decompiler algorithm converts this to

    beforeloop
    do {
    if (!foo) {break}
    stmt1
    stmt2
    } while (true)
    afterloop
    

    I hope that cleared that up. If not, feel free to ask about any other questions.

    Edit: Example 2, showing how multiple loops with the same entry point can be collapsed.

    for(;;) { while(foo) {} while(bar){} }
    

    First off, for(;;) is equivalent to while(true), so I'll use the following (pseudo)code instead

    while(true) { while(foo) {stmt1} while(bar){stmt2} }
    

    Let the outer loop be loop A and the inner loops be loop B and C. This compiles to something like the following pseudo assembly.

    LOOP_A_BEGIN:
    LOOP_B_BEGIN:
    if !foo goto LOOP_B_END
    stmt1
    goto LOOP_B_BEGIN
    LOOP_B_END:
    LOOP_C_BEGIN:
    if !bar goto LOOP_C_END
    stmt2
    goto LOOP_C_BEGIN
    LOOP_C_END:
    goto LOOP_A_BEGIN
    

    But of course labels don't take up any space. So with identical labels collapsed, it becomes

    POINT1:
    if !foo goto POINT2
    stmt1
    goto POINT1
    POINT2:
    if !bar goto POINT3
    stmt2
    goto POINT2
    POINT3
    goto POINT1
    

    Now, there are two points with backedges - point 1 and point 2. We can create one loop for each node, using labeled breaks for clarity. The transform isn't quite as straightforward since you have to mess with the if statements a bit, but it's still pretty easy.

    LOOP1: while(true) {
        IF1: if (!foo) {
            break IF1;
        }
        else {
            stmt1;
            continue LOOP1;
        }
    
        LOOP2: while(true) {
            if (!bar) {
                break LOOP2;
            }
            else {
                stmt2;
                continue LOOP2;
            }
        }
    
        continue LOOP1;
    }
    

    Now, the same code with unnecessary labels simplified out

    while(true) {
        if (!foo) {
        }
        else {
            stmt1;
            continue;
        }
    
        while(true) {
            if (!bar) {
                break;
            }
            else {
                stmt2;
            }
        }
    }
    

    Now with if statements simplified

    while(true) {
        if (foo) {
            stmt1;
            continue;
        }
    
        while(true) {
            if (!bar) {
                break;
            }
            stmt2;
        }
    }
    

    And finally, you can apply the while(true) if(!x) transform to the inner loop. The outer loop can't be transformed like this, since it's not a simple while(cond) loop due to being the result of merged loops.

    while(true) {
        if (foo) {
            stmt1;
            continue;
        }
    
        while(bar) {
            stmt2;
        }
    }
    

    So hopefully, this demonstrates how you can always handle the case of multiple loops with the same entry point by merging them into a single loop, at the possible expense of having to rearrange some if statements too.