Search code examples
pythoncstatic-analysiscontrol-flow-graphpycparser

Finding the input dependencies of a functions outputs


I've been working on a python program with pycparser which is supposed to generate a JSON-file with dependencies of a given function and its outputs. For an example function:

int Test(int testInput)
{
    int b = testInput;
    return b;
}

Here I would expect b to be dependent on testInput. But ofcourse it can get a lot more complicated with structs and if-statements etc. The files I'm testing also have functions in a specific form that are considered inputs and outputs as in:

int Test(int testInput)
{
    int anotherInput = DatabaseRead(VariableInDatabase);
    int b = testInput;
    int c;

    c = anotherInput + 1;
    DatabaseWrite(c);
    return b;
}

Here c would be dependent on VariableInDatabase, and b same as before. I've run into a wall with this analysis in pycparser as mostly structs and pointers are really hard for me to handle, and it seems like there'd be a better way. I've read into ASTs and CFGs, and other analysis tools like Frama-C but I can't seem to find a clear answer if this is even a thing.

Is there a known way to do this kind of analysis, and if so, what should I be looking into? It's supposed to thousands of files and be able to output these dependencies into a JSON, so plugins for editors doesn't seem like what I'm looking for.


Solution

  • You need data flow analysis of your code, and then you want to follow the data flow backwards from a result to its sources, up to some stopping point (in your case, you stopped at a function parameter but you probably also want to stop at any global variable).

    This is called program slicing in the literature.

    Computing data flows is pretty hard, especially if you have a complex language (C is fun: you can have data flows through indirectly called functions that read values; now you need indirect points-to analysis to support your data flow, and vice versa).

    Here's fun example:

     // ocean of functions:
     ...
     int a(){ return b; }
     ...
     int p(){ return q; }
     ...
      void foo( int()* x )
     {  return (*x)(); }
    

    Does foo depend on b? on q? You can't know unless you know that foo calls a or b. But foo is handed a function pointer... and what might that point to?

    Using just ASTs and CFGs is necessary but not sufficient; data flow analysis algorithms are hard, especially if you have scale (as you suggest you do); you need a lot of machinery to do this that is not easy to build [We've done this on C programs of 16 million lines]. See my essay on Life After Parsing.