python r language-agnostic profiling parse-tree

Does a line profiler for code require a parse tree and is that sufficient?

I am trying to determine what is necessary to write a line profiler for a language, like those available for Python and Matlab.

A naive way to interpret "line profiler" is to assume that one can insert time logging around every line, but the definition of a line is dependent on how a parser handles whitespace, which is only the first problem. It seems that one needs to use the parse tree and insert timings around individual nodes.

Is this conclusion correct? Does a line profiler require the parse tree, and is that all that is needed (beyond time logging)?

Update 1: Offering a bounty on this because the question is still unresolved.

Update 2: Here is a link for a well known Python line profiler in case it is helpful for answering this question. I've not yet been able to make heads or tails of it's behavior relative to parsing. I'm afraid that the code for the Matlab profiler is not accessible.

Also note that one could say that manually decorating the input code would eliminate a need for a parse tree, but that's not an automatic profiler.

Update 3: Although this question is language agnostic, this arose because I am thinking of creating such a tool for R (unless it exists and I haven't found it).

Update 4: Regarding use of a line profiler versus a call stack profiler - this post relating to using a call stack profiler (Rprof() in this case) exemplifies why it can be painful to work with the call stack rather than directly analyze things via a line profiler.

Solution

I'd say that yes, you require a parse tree (and the source) - how else would you know what constitutes a "line" and a valid statement?

A practical simplification though might be an "statement profiler" instead of a "line profiler". In R, the parse tree is readily available: body(theFunction), so it should be fairly easy to insert measuring code around each statement. With some more work you can insert it around a group of statements that belong to the same line.

In R, the body of a function loaded from a file typically also has an attribute srcref that lists the source for each "line" (actually each statement) :

Here's a sample function (put in "example.R"):

f <- function(x, y=3)
{
    a <- 0; a <- 1  # Two statements on one line
    a <- (x + 1) *  # One statement on two lines
        (y + 2)

    a <- "foo       
        bar"        # One string on two lines
}

Then in R:

source("example.R")
dput(attr(body(theFunction), "srcref"))

Which prints this line/column information:

list(structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L), srcfile = <environment>, class = "srcref"), 
    structure(c(3L, 2L, 3L, 7L, 9L, 14L, 3L, 3L), srcfile = <environment>, class = "srcref"), 
    structure(c(3L, 10L, 3L, 15L, 17L, 22L, 3L, 3L), srcfile = <environment>, class = "srcref"), 
    structure(c(4L, 2L, 5L, 15L, 9L, 15L, 4L, 5L), srcfile = <environment>, class = "srcref"), 
    structure(c(7L, 2L, 8L, 6L, 9L, 20L, 7L, 8L), srcfile = <environment>, class = "srcref"))

As you can "see" (the last two numbers in each structure are begin/end line), the expressions a <- 0 and a <- 1 map to the same line...

Good luck!