Search code examples
rnested-loopslapply

What can substitute a nested loop in R


I'd like to create the dataframe output from dataframe input in R by running multiple scenarios given two variables x and y. The column output is the sum of all the values in the column value where there xcol < x & ycol < y.

input =
xcol ycol value
1   5   4
2   6   9
3   7   8
4   9   7
5   14  8

And

output= 
x   y   results
2   5   0
2   10  4
2   15  35
...
6   5   0
6   10  27
6   15  35

My code is currently like this:

for (x in 2:6) {
  if (x%% 2){
    next
  }
  for (y in 5:15) {
    if (y %% 5){
      next
    }
    print(x)
    print(y)
    print(sum(input$value[!is.na(input$xcol) & !is.na(input$ycol) & !is.na(input$value) & 
              input$xcol < x &  input$ycol < y]))
  }
}

There should be a better way to replace this nested loop using lapply & sapply and create a dataframe I would think. I'd appreciate any helps.

Thanks


Solution

  • This seems more like a design-of-experiments, in a sense, where you are iterating over different possible values of x and y.

    xs <- 2:6
    ys <- 5:15
    eg <- expand.grid(x = xs, y = ys)
    head(eg)
    #   x y
    # 1 2 5
    # 2 3 5
    # 3 4 5
    # 4 5 5
    # 5 6 5
    # 6 2 6
    

    I think your %% filtering should be done outside/before this, so:

    xs <- xs[!xs %% 2]
    ys <- ys[!ys %% 5]
    eg <- expand.grid(x = xs, y = ys)
    head(eg)
    #   x  y
    # 1 2  5
    # 2 4  5
    # 3 6  5
    # 4 2 10
    # 5 4 10
    # 6 6 10
    

    From here, you can just iterate over the rows:

    eg$out <- sapply(seq_len(nrow(eg)), function(r) {
      sum(input$value[ complete.cases(input) & input$xcol < eg$x[r] & input$ycol < eg$y[r] ])
    })
    eg
    #   x  y out
    # 1 2  5   0
    # 2 4  5   0
    # 3 6  5   0
    # 4 2 10   4
    # 5 4 10  21
    # 6 6 10  28
    # 7 2 15   4
    # 8 4 15  21
    # 9 6 15  36
    

    I think your output variable is a little off, since "2,15" should only include input$value[1] (x < 2 is the limiting factor). (Other differences exist.)

    Regardless of your actual indexing logic, I suggest this methodology over a double-for or double-lapply implementation.

    NB:

    1. These commands are functionally equivalent with this input:

      complete.cases(input)                                         # 1
      complete.cases(input[c("xcol","ycol","value")])               # 2
      !is.na(input$xcol) & !is.na(input$xcol) & !is.na(input$value) # 3
      

      I used the first since "code golf", but if your actual input data.frame contains other columns, you may prefer the second to be more selective of which columns require non-NA values.

    2. expand.grid works great for this type of expansion. However, if you are looking at significantly larger datasets (including if your filtering is more complex than %% offers), than it can be a little expensive as it must create the entire data.frame in memory. Python's use of lazy iterators would be useful here, in which case you might prefer to use https://stackoverflow.com/a/36144255/3358272 (expanded function in a github gist with some docs: https://gist.github.com/r2evans/e5531cbab8cf421d14ed).