Decompiler - how to structure loops

I'm writing a decompiler for a simple scripting language. Here is what I've done:

Basic blocks

Created a collection of basic blocks as described here:

http://www.backerstreet.com/decompiler/basic_blocks.php

Control flow graph, dominator tree and loop sets

From this I could create the control flow graph.

http://www.backerstreet.com/decompiler/control_flow_graph.php

From the CFG I created the dominator tree, which in turn allowed me to find the loop sets in the CFG as described here:

http://www.backerstreet.com/decompiler/loop_analysis.php

Here is an image containing all of my data so far:

Control flow diagram

Structuring loops

My next step should be:

http://www.backerstreet.com/decompiler/creating_statements.php

And this is where my question comes in because I'm totally stuck. In my given data how would the structuring loops algorithm apply? I don't understand why it starts by trying to structure everything as a do while loop - in my example this means that "temp5_14 = temp5_14 + 16" in block 3 would always get executed at least once which is not what the original code does at all.

How could that work and how would the next pass of converting it from a do-while to a while loop actually work? For loop 3 that ends in block 6 this looks like it should be a while(true) - but how would this work with the algorithm when its head block is an if statement?

TL;DR - someone please explain with examples how the "structuring loops" algorithm actually works.

Solution

Structuring is the hardest part of decompiler development (at least for high level languages). That's a fairly simplistic algorithm, so it's a good starting point, but you'll likely want to use a better algorithm or make your own if you're working on a real decompiler.

With that out of the way, the answer to your actual question about how it can use do-while loops instead of while loops is already answered on the page you linked to.

Every loop can be described with a "do-while" statement.

The "while" loop (pre-tested loop) is a special case of a "do-while" loop, where the bottom condition is always true, and the first statement of the loop is an "if" that jumps out of the loop.

Say you had something like

beforeloop
while(foo) {
stmt1
stmt2
}
afterloop

It would be compiled to something along the lines of

beforeloop

LOOPBEGIN:
if !foo goto LOOPEND
stmt1
stmt2
goto LOOPBEGIN

LOOPEND:
afterloop

The decompiler algorithm converts this to

beforeloop
do {
if (!foo) {break}
stmt1
stmt2
} while (true)
afterloop

I hope that cleared that up. If not, feel free to ask about any other questions.

Edit: Example 2, showing how multiple loops with the same entry point can be collapsed.

for(;;) { while(foo) {} while(bar){} }

First off, for(;;) is equivalent to while(true), so I'll use the following (pseudo)code instead

while(true) { while(foo) {stmt1} while(bar){stmt2} }

Let the outer loop be loop A and the inner loops be loop B and C. This compiles to something like the following pseudo assembly.

LOOP_A_BEGIN:
LOOP_B_BEGIN:
if !foo goto LOOP_B_END
stmt1
goto LOOP_B_BEGIN
LOOP_B_END:
LOOP_C_BEGIN:
if !bar goto LOOP_C_END
stmt2
goto LOOP_C_BEGIN
LOOP_C_END:
goto LOOP_A_BEGIN

But of course labels don't take up any space. So with identical labels collapsed, it becomes

POINT1:
if !foo goto POINT2
stmt1
goto POINT1
POINT2:
if !bar goto POINT3
stmt2
goto POINT2
POINT3
goto POINT1

Now, there are two points with backedges - point 1 and point 2. We can create one loop for each node, using labeled breaks for clarity. The transform isn't quite as straightforward since you have to mess with the if statements a bit, but it's still pretty easy.

LOOP1: while(true) {
    IF1: if (!foo) {
        break IF1;
    }
    else {
        stmt1;
        continue LOOP1;
    }

    LOOP2: while(true) {
        if (!bar) {
            break LOOP2;
        }
        else {
            stmt2;
            continue LOOP2;
        }
    }

    continue LOOP1;
}

Now, the same code with unnecessary labels simplified out

while(true) {
    if (!foo) {
    }
    else {
        stmt1;
        continue;
    }

    while(true) {
        if (!bar) {
            break;
        }
        else {
            stmt2;
        }
    }
}

Now with if statements simplified

while(true) {
    if (foo) {
        stmt1;
        continue;
    }

    while(true) {
        if (!bar) {
            break;
        }
        stmt2;
    }
}

And finally, you can apply the while(true) if(!x) transform to the inner loop. The outer loop can't be transformed like this, since it's not a simple while(cond) loop due to being the result of merged loops.

while(true) {
    if (foo) {
        stmt1;
        continue;
    }

    while(bar) {
        stmt2;
    }
}

So hopefully, this demonstrates how you can always handle the case of multiple loops with the same entry point by merging them into a single loop, at the possible expense of having to rearrange some if statements too.