Live-variable analysis in functional style

I'm trying to implement live variable analysis in OCaml. I already have the control flow graph built, but I'm having trouble thinking of ways to compute the dataflow equations in a purely functional style. Directly implementing the equations leads to infinite recursion. On the other doing it iteratively would require side effects to reassign the sets.

Minimum (nonworking) example: This is pretty much a direct implementation of the dataflow equations.

open List

(*                  successors  gen set    kill set *)
type node = Node of node list * int list * int list

(* variables *)
let v1 = 1
let v2 = 2
let v3 = 3

(* nodes *)
let rec n1 = Node ([n2], [], [v1])
    and n2 = Node ([n3], [], [v3])
    and n3 = Node ([n4], [v1], [])
    and n4 = Node ([n5], [v1; v3], [v2])
    and n5 = Node ([n1], [v2], [])

let unique = sort_uniq (fun a b -> a - b)

(* live_in[n] = gen[n] union (out[n] - kill[n]) *)
let rec live_in n = match n with Node (_, gen, kill) -> unique (gen @
    filter (fun v -> not (mem v kill)) (live_out n))

(* live_out[n] = union over p in succ[n] in[p] *)
    and live_out (Node (succ, _, _)) = unique (concat_map live_in succ)

let x = live_in n1  (* infinite recursion *)

Solution

Any program, no matter if it is written in a functional style or in an imperative, operates with a program state. The only difference between the two styles is that in an imperative program, the state is mutated in place, and in functional, each program step produces a fresh new instance of state.

Otherwise, the end goal of every program is to produce a new state after a series of iterations (or keep producing new states, if the program is running in an infinite loop, like a web server).

The first problem with your program is that it lacks any control-flow statements, so it is destined to run forever (or until it exhausts system resources).

When you develop a recursive algorithm, you should ask yourself questions like what is the stop condition and how is the state changes on each step, so that it guarantees that the stop condition will be eventually reached, e.g.,

let rec sum n = 
  if n <= 0 then 0 else 1 + sum (n - 1)

Here we see the stop condition is n <= 0 and on each step the state, which is represented by an integer n changes, in our case it is reduced by one, so it guarantees that we will eventually hit the stop condition n <= 0.

In liveness analysis, the state is the set of live variables indexed by the node, e.g., L(i) is the set of variables that are alive at the exit of node i. For a graph with n nodes the state will be an indexed family {L(1),...,L(n)}.

To compute this state, you start with the initial approximation that all variables are dead, e.g, {(),(),...,()} Then you visit each node of your graph and propagate the live variables from the exit of this node to the exits of its predecessors (it is the backward analysis). Your stop condition is when the state stabilizes, i.e., on step N we have the same sets of live variables as we had on step N-1. On every node, we propagate (transfer) live variables to the predecessors. For that, we take live variables on our exit (which are either empty or were computed on the previous steps), remove all variables that are defined in this node (aka the KILL set), and add all variables that this node uses (aka the GEN set). The GEN and KILL sets are static, i.e., they do not change as we iterate.

After you visited every node in the graph, you have a partial solution. With each iteration (and every iteration, in turn, visits all the nodes), you refine this solution. At some point of time, you will get the same solution, which will indicate that you have reached the fixed point. This solution will be the correct answer to the liveness problem.

Now back to your approach. It has two fundamental problems (other than it doesn't even try to stop). First, you don't have access to the whole state, i.e., free variables indexed by nodes (you can't make a decision to stop based only on a transfer through a single node, you have to be sure that all nodes are stable). Next, your recursive definition defines the final solution not a recursive step that is passed to the fixed point function. Your definition will fit nicely into logical languages like Prolog and Datalog, which have a built-in fixed point computation, but OCaml is a general-purpose language, so you need to define your iterations manually.

So you need to define a state, you can use, of course, a list of lists, but it is better to use maps and sets, e.g.,

module Var = Int
module Node = Int

type state = Set.Make(Var).t Map.Make(Int).t

Next, you need to iterate over every node, visiting N nodes on each step, where N is the number of nodes. I.e., instead of computing your graph, and this is what you're doing, you are following the infinite number of paths in your graph, you need to walk through the nodes, each taking once per step. The order of iterations doesn't matter from the perspective of correctness but matters from the perspective of the performance (hint, you shall start from the exit block since it is a backward analysis). And last but not least, you should follow predecessors, not successors, since, again, it is a backward analysis.

Finally, of course, in real life, this is not how dataflow problems are solved in OCaml. First of all, the graph structures are rarely static in the sense that you can encode them as an OCaml term. You probably don't want to recompile your OCaml program for every new input (if you really want it, then try MetaOCaml). So in real life, we either use existing libraries for graphs, e.g., graphlib, ocamlgraph or define them as mappings. Next, we compute fixed points not by running the whole computational and comparing results, but by employing a worklist algorithm, like Kildall's method, e.g., here's how it is implemented in graphlib. And if you wonder how it looks all together here is the implementation of the liveness analysis (corresponding documentation) in the Binary Analysis Platform (BAP).