I am trying to calculate the sum for different columns in a data frame, based on different criteria.
Year <- c(2001, 2001, 2001, 2002, 2002, 2003, 2003, 2004, 2004)
COMP1 <- c(NA, 1, 2, 6, 9, NA, 2, 1, NA)
COMP2 <- c(2, 3, 3, 3, 6, 4, > 1, 0, 1)
COMP3 <- c(NA, 1, 2, 3, 4, 0, 0, 1, 0)
COMP4 <- c(25, 29, 16, 17, NA, 20, NA, 21, 12)
DF <- data.frame(Year, COMP1, COMP2, COMP3, COMP4)
So this is want I am trying to do:
For every year and every company (COMP), I want to calculate the sum of the values, BUT ONLY if the individual value in the cell is above the mean of values within that year).
I have achieved to calculate the sum and the mean, but I dont know how to combine them to create a sum with the condition mentioned.
These are the formulas I've tried so far (please note that my original sample has 6121 columns of different companies):
SUM <- aggregate(DF[, 2:6121], list(DF$Year), sum)
MEAN <- aggregate(DF[, 2:6121], list(DF$Year), mean)
library(tidyverse)
df %>%
summarize(across(contains("COMP"),
~ sum(.x[.x > mean(.x, na.rm = TRUE)], na.rm = TRUE)),
.by = Year)
# A tibble: 4 × 5
Year COMP1 COMP2 COMP3 COMP4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 2 6 2 54
2 2002 9 6 4 0
3 2003 0 4 0 0
4 2004 0 1 1 21
Include values equals to and above the groups mean:
df %>%
summarize(across(1:4,
~ sum(.x[.x >= mean(.x, na.rm = TRUE)], na.rm = TRUE)),
.by = Year)
# A tibble: 4 × 5
Year COMP1 COMP2 COMP3 COMP4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 2 6 2 54
2 2002 9 6 4 17
3 2003 2 4 0 20
4 2004 1 1 1 21