I have a dataframe which looks something like this
df <-
data.frame(col1 = rep(1:3,2),
col2 = seq(2,12,2))
col1 col2
1 1 2
2 2 4
3 3 6
4 1 8
5 2 10
6 3 12
I want to create col3
which gives sums below as follows:
col1 == 1
I want to sum col2
of every col1==2
below that iteration.col1 == 2
I want to sum col2
of every col1==3
below that iteration;1:7
in col1
.So my desired output would look like:
col1 col2 col3
1 1 2 14
2 2 4 18
3 3 6 NA
4 1 8 10
5 2 10 12
6 3 12 NA
I have made attempts to do this using for
loops but I think there may be a more simple way of approaching.
dplyr
approach with iterationTo keep this readable, I think you have to iterate here as the condition by which you're calculating the sum differs for every row. But you can do it only once by defining the target value for each row inside your loop (or sapply()
in this case) and then looking for values in the corresponding rows which match. I use dplyr::mutate()
here as you've tagged dplyr:
library(dplyr)
df |>
mutate(
col3 = sapply(row_number(), \(x) {
target <- if_else(col1 %in% c(1, 2), col1 + 1, NA)[x]
sum(col2[x:n()][col1[x:n()] == target])
})
)
# col1 col2 col3
# 1 1 2 14
# 2 2 4 18
# 3 3 6 NA
# 4 1 8 10
# 5 2 10 12
# 6 3 12 NA
Alternatively, we can use outer()
here to avoid iteration. I don't really like this approach - I just wrote it and I already think it's barely comprehensible. On the other hand, it might be quicker if your data is big. Having said that, perhaps not, as it will create a large intermediate matrix of size nrow(df)^2
, which may be more expensive than sapply()
. Benchmarking is key if performance is a concern.
res <- with(
df,
outer(
seq_along(col1),
seq_along(col1),
\(i, j)
j > i & col1[j] == col1 + 1
) %*% col2
)
df |>
transform(
col3 = ifelse(
col1 %in% c(1, 2), res, NA
)
)
# col1 col2 col3
# 1 1 2 14
# 2 2 4 18
# 3 3 6 NA
# 4 1 8 10
# 5 2 10 12
# 6 3 12 NA