Search code examples
rdplyrtidyverse

Sum column of rows below which meet a certain criteria


I have a dataframe which looks something like this

df <- 
  data.frame(col1 = rep(1:3,2),
             col2 = seq(2,12,2))

  col1 col2
1    1    2
2    2    4
3    3    6
4    1    8
5    2   10
6    3   12

I want to create col3 which gives sums below as follows:

  • for each col1 == 1 I want to sum col2 of every col1==2 below that iteration.
  • for each col1 == 2 I want to sum col2 of every col1==3 below that iteration;
  • and so on (actual data has a range of 1:7 in col1.

So my desired output would look like:

  col1 col2 col3
1    1    2   14
2    2    4   18
3    3    6   NA
4    1    8   10
5    2   10   12
6    3   12   NA

I have made attempts to do this using for loops but I think there may be a more simple way of approaching.


Solution

  • dplyr approach with iteration

    To keep this readable, I think you have to iterate here as the condition by which you're calculating the sum differs for every row. But you can do it only once by defining the target value for each row inside your loop (or sapply() in this case) and then looking for values in the corresponding rows which match. I use dplyr::mutate() here as you've tagged :

    library(dplyr)
    df |>
        mutate(
            col3 = sapply(row_number(), \(x) {
                target <- if_else(col1 %in% c(1, 2), col1 + 1, NA)[x]
                sum(col2[x:n()][col1[x:n()] == target])
            })
        )
    #   col1 col2 col3
    # 1    1    2   14
    # 2    2    4   18
    # 3    3    6   NA
    # 4    1    8   10
    # 5    2   10   12
    # 6    3   12   NA
    

    base R approach without iteration

    Alternatively, we can use outer() here to avoid iteration. I don't really like this approach - I just wrote it and I already think it's barely comprehensible. On the other hand, it might be quicker if your data is big. Having said that, perhaps not, as it will create a large intermediate matrix of size nrow(df)^2, which may be more expensive than sapply(). Benchmarking is key if performance is a concern.

    res <- with(
        df,
        outer(
            seq_along(col1),
            seq_along(col1),
            \(i, j)
            j > i & col1[j] == col1 + 1
        ) %*% col2
    )
    df |>
        transform(
            col3 = ifelse(
                col1 %in% c(1, 2), res, NA
            )
        )
    
    #   col1 col2 col3
    # 1    1    2   14
    # 2    2    4   18
    # 3    3    6   NA
    # 4    1    8   10
    # 5    2   10   12
    # 6    3   12   NA