Search code examples
rconditional-statementssummean

How to create a conditional sum in R?


I am trying to calculate the sum for different columns in a data frame, based on different criteria.

This is my trial data: enter image description here

Year <- c(2001, 2001, 2001, 2002, 2002, 2003, 2003, 2004, 2004)
COMP1 <- c(NA, 1, 2, 6, 9, NA, 2, 1, NA)  
COMP2 <- c(2, 3, 3, 3, 6, 4, > 1, 0, 1)   
COMP3 <- c(NA, 1, 2, 3, 4, 0, 0, 1, 0)   
COMP4 <- c(25, 29, 16, 17, NA, 20, NA, 21, 12)

DF <- data.frame(Year, COMP1, COMP2, COMP3, COMP4)

So this is want I am trying to do:

For every year and every company (COMP), I want to calculate the sum of the values, BUT ONLY if the individual value in the cell is above the mean of values within that year).

I have achieved to calculate the sum and the mean, but I dont know how to combine them to create a sum with the condition mentioned.

These are the formulas I've tried so far (please note that my original sample has 6121 columns of different companies):

SUM <- aggregate(DF[, 2:6121], list(DF$Year), sum)

MEAN <- aggregate(DF[, 2:6121], list(DF$Year), mean)

Solution

  • library(tidyverse)
    
    df %>%
      summarize(across(contains("COMP"), 
                       ~ sum(.x[.x > mean(.x, na.rm = TRUE)], na.rm = TRUE)), 
                .by = Year)
    
    # A tibble: 4 × 5
       Year COMP1 COMP2 COMP3 COMP4
      <dbl> <dbl> <dbl> <dbl> <dbl>
    1  2001     2     6     2    54
    2  2002     9     6     4     0
    3  2003     0     4     0     0
    4  2004     0     1     1    21
    

    Include values equals to and above the groups mean:

    df %>%
      summarize(across(1:4, 
                       ~ sum(.x[.x >= mean(.x, na.rm = TRUE)], na.rm = TRUE)), 
                .by = Year)
    
    # A tibble: 4 × 5
       Year COMP1 COMP2 COMP3 COMP4
      <dbl> <dbl> <dbl> <dbl> <dbl>
    1  2001     2     6     2    54
    2  2002     9     6     4    17
    3  2003     2     4     0    20
    4  2004     1     1     1    21