Search code examples
rconditional-statementsstandard-deviation

Calculate standard deviation of a subgroup of data


I am trying to calculate the standard deviation for a subgroup of data within my data set. For every year, the standard deviation shall be calculated for those values, that are above the mean value for all the data for the respective year. In this way, the values above the mean, form a subgroup "SUBGROUP" and I want to calculate the standard deviation for this particular group.

This is my trial data:

Year <- c(2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002, 2003, 2003, 2003, 2003, 2004, 2004, 2004, 2004) 
COMP1 <- c(NA, 1, 2, 6, 9, NA, 2, 1, NA, 2, 9, 6, NA, 1, 8, 5) 
COMP2 <- c(2, 3, 3, 3, 6, 4, 1, 0, 1, 3, 6, 1, NA, 1, 8, 8)  
COMP3 <- c(NA, 1, 2, 3, 4, 0, 0, 1, 0, 4, 2, 2, 1, NA, 1, 1)  
COMP4 <- c(25, 29, 16, 17, NA, 20, NA, 21, 12, 17, 31, 32, 21, 1, 2, 1)

DF <- data.frame(Year, COMP1, COMP2, COMP3, COMP4)

And this is the code, I've tried so far (there is an error though and I don't find it).

SUBGROUP <- DF  %>%
  summarize(across(c(1:4), 
                       ~ sd(.x[which(.x = sum(.x[which(.x > mean(.x, na.rm = TRUE))], 
na.rm = TRUE))], na.rm = TRUE, .by = Year)

Does anybody have an idea how to fix my formula?


Solution

  • You could do:

    library(tidyverse)
    DF  %>%
      pivot_longer(-Year) %>%
      filter(!is.na(value)) %>%
      summarize(SD = sd(value[value > mean(value)]), .by = c(Year, name)) %>%
      pivot_wider(values_from = SD)
    

    Which gives:

    # A tibble: 4 x 5
       Year COMP1 COMP2 COMP3  COMP4
      <dbl> <dbl> <dbl> <dbl>  <dbl>
    1  2001 NA     0       NA  2.83 
    2  2002 NA     1.41    NA NA    
    3  2003  2.12  2.12    NA  0.707
    4  2004  2.12  0       NA NA  
    

    Alternatively, you could do:

    DF  %>%
      summarize(across(COMP1:COMP4, 
                ~sd(.x[which(.x > mean(.x, na.rm = TRUE))])), .by = Year)
    

    This gives the dame result.