Search code examples
rmapreducenestedcode-readability

How to structure R code for inherently nested problems to be easily readible?


There are problems that inherently require several layers of nesting to be solved. In a current project, I frequently find myself using three nested applys in order to do something with the elements contained in the deepest layer of a nested list structure.

R's list handling and apply-family allows for quite concise code for this type of problems, but writing it still gives me headaches and I'm pretty sure that anyone else reading it will have to take several minutes to understand what I am doing. This is although I'm basically doing nothing complicated - if it weren't for the multiple layers of lists to be traversed.

Below I provide a snippet of code that I wrote that shall serve as an example. I would consider it concise, yet hard to read.

Context: I am writing a simulation of surface electromyographic data, i.e. changes in the electrical potential that can be measured on the human skin and which is invoked by muscular activity. For this, I consider several muscles (first layer of lists) which each consist of a number of so-called motor units (second layer of lists) which each are related to a set of electrodes (third layer of lists) placed on the skin. In the example code below, we have the objects .firing.contribs, which contains the information on how strong the action of a motor unit influences the potential at a specific electrode, and .firing.instants which contains the information on the instants in time at which these motor units are fired. The given function then calculates the time course of the potential at each electrode.

Here's the question: What are possible options to render this type of code easily understable for the reader? Particularly: How can I make clearer what I am actually doing, i.e. performing some computation on each (MU, electrode) pair and then summing up all potential contributions to each electrode? I feel this is currently not very obvious in the code.

Notes:

  • I know of plyr. So far, I have not been able to see how I can use this to render my code more readable, though.
  • I can't convert my structure of nested lists to a multidimensional array since the number of elements is not the same for each list element.
  • I searched for R implementations of the MapReduce algorithm, since this seems to be applicable to the problem I give as an example (although I'm not a MapReduce expert, so correct me if I'm wrong...). I only found packages for Big Data processing, which is not what I'm interested in. Anyway, my question is not about this particular (Map-Reducible) case but rather about a general programming pattern for inherently nested problems.
  • I am not interested in performance for the moment, just readability.

Here's the example code.

sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {

    calc.MU.contribs <- function(.MU.firing.contribs, .MU.firing.instants)
        lapply(.MU.firing.contribs, calc.MU.electrode.contrib,
               .MU.firing.instants)

    calc.muscle.contribs <- function(.MU.firing.contribs, .MU.firing.instants) {

        MU.contribs <- mapply(calc.MU.contribs, .MU.firing.contribs,
                              .MU.firing.instants, SIMPLIFY = FALSE)

        muscle.contribs <- reduce.contribs(MU.contribs)
    }

    muscle.contribs <- mapply(calc.muscle.contribs, .firing.contribs,
                              .firing.instants, SIMPLIFY = FALSE)

    surface.potentials <- reduce.contribs(muscle.contribs)
}

## Takes a list (one element per object) of lists (one element per electrode)
## that contain the time course of the contributions of that object to the
## surface potential at that electrode (as numerical vectors).
## Returns a list (one element per electrode) containing the time course of the
## contributions of this list of objects to the surface potential at all
## electrodes (as numerical vectors again).
reduce.contribs <- function(obj.list) {

    contribs.by.electrode <- lapply(seq_along(obj.list[[1]]), function(i)
                                    sapply(obj.list, `[[`, i))

    contribs <- lapply(contribs.by.electrode, rowSums)
}


calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {

    ## This will in reality be more complicated since then .MU.firing.contrib
    ## will have a different (more complicated) structure.
    .MU.firing.contrib * .MU.firing.instants
}

firing.contribs <- list(list(list(1,2),list(3,4)),
                         list(list(5,6),list(7,8),list(9,10)))

firing.instants <- list(list(c(0,0,1,0), c(0,1,0,0)),
                         list(c(0,0,0,0), c(1,0,1,0), c(0,1,1,0)))

surface.potentials <- sum.MU.firing.contributions(firing.contribs, firing.instants)

Solution

  • As suggested by user @alexis_laz, it is a good option to actually not use a nested list structure for representing data, but rather use a (flat, 2D, non-nested) data.frame. This greatly simplified coding and increased code conciseness and readability for the above example and seems to be promising for other cases, too.

    It completely eliminates the need for nested apply-foo and complicated list traversing. Moreover, it allows for the usage of a lot of built-in R functionality that works on data frames.

    I think there are two main reasons why I did not consider this solution before:

    • It does not feel like the natural representation of my data, since the data actually are nested. By fitting the data into a data.frame, we're losing direct information on this nesting. It can of course be reconstructed, though.
    • It introduces redundancy, as I will have a lot of equivalent entries in the categorical columns. Of course this doesn't matter by any measure since it's just indices that don't consume a lot of memory or something, but it still feels kind of bad.

    Here is the rewritten example. Comments are most welcome.

    sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {
    
        firing.info <- merge(.firing.contribs, .firing.instants)
    
        firing.info$contribs.time <- mapply(calc.MU.electrode.contrib,
                                            firing.info$contrib,
                                            firing.info$instants,
                                            SIMPLIFY = FALSE)
    
        surface.potentials <- by(I(firing.info$contribs.time),
                                 factor(firing.info$electrode),
                                 function(list) colSums(do.call(rbind, list)))
    
        surface.potentials
    }
    
    
    calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {
    
        ## This will in reality be more complicated since then .MU.firing.contrib
        ## will have a different (more complicated) structure.
        .MU.firing.contrib * .MU.firing.instants
    }
    
    
    firing.instants <- data.frame(muscle = c(1,1,2,2,2),
                                  MU = c(1,2,1,2,3),
                                  instants = I(list(c(F,F,T,F),c(F,T,F,F),
                                      c(F,F,F,F),c(T,F,T,F),c(F,T,T,F))))
    
    firing.contribs <- data.frame(muscle = c(1,1,1,1,2,2,2,2,2,2),
                                  MU = c(1,1,2,2,1,1,2,2,3,3),
                                  electrode = c(1,2,1,2,1,2,1,2,1,2),
                                  contrib = c(1,2,3,4,5,6,7,8,9,10))
    
    surface.potentials <- sum.MU.firing.contributions(firing.contribs,
                                                      firing.instants)