Search code examples
rdplyrstatisticsstandard-deviation

How to calculate the average of multiple standard deviations in R?


I am trying to figure out how to calculate the standard deviation of a dataset when I have a couple of standard deviations. Let's just look at this MWE:

set.seed(1234)
dummy_data <- data.frame(
  "col_1" = sample(1:7, size = 10, replace = TRUE),
  "col_2" = sample(1:7, size = 10, replace = TRUE),
  "col_3" = sample(1:7, size = 10, replace = TRUE),  
  "col_4" = sample(1:7, size = 10, replace = TRUE)
)

Now since I know all the data points I can calculate the total standard deviation as follows:

> sd(as.matrix(dummy_data))
[1] 1.727604

But the real data that I have at hand is the following:

> dplyr::summarise_all(dummy_data, sd)
     col_1    col_2   col_3    col_4
1 1.837873 1.873796 1.37032 1.888562

If I follow the usual method of calculating the average of multiple standard deviations with similar sample sizes, I would apply the following:

sds <- dplyr::summarise_all(dummy_data, sd)
vars <- sds^2
mean_sd <- sqrt(sum(vars) / (length(vars) - 1))

> mean_sd
[1] 2.027588

which is not the same! Now I have tried without the minus one:

> sqrt(sum(vars) / (length(vars)))
[1] 1.755942

which does not solve the problem. I have tried defining an own standard deviation function like this:

own_sd <- function(x) {
  sqrt(sum((x - mean(x))^2) / length(x))
}

to eliminate the x - 1 in the dplyr::summarise_all() step, and then average according to the step above:

> sqrt(sum(dplyr::summarise_all(dummy_data, own_sd)^2) / 3)
[1] 1.923538
> sqrt(sum(dplyr::summarise_all(dummy_data, own_sd)^2) / 4)
[1] 1.665833

But all seem to give a different result than the sd(as.matrix()) method. What is going wrong here?


Solution

  • You can't calculate a global SD from only knowing group SDs. For example:

    x1 = 1:5
    x2 = 11:15
    x3 = 101:105
    
    ## all the SDs are equal
    (sd1 = sd(x1))
    #[1] 1.581139
    (sd2 = sd(x2))
    #[1] 1.581139
    (sd3 = sd(x3))
    #[1] 1.581139
    
    ## however, combining the groups in pairs give very different results
    sd(c(x1, x2))
    # [1] 5.477226
    
    sd(c(x1, x3))
    # [1] 52.72571
    

    This demonstrates that even if the sample sizes are identical, knowing the standard deviation of two groups does not help you calculate the standard deviation of those groups combined.

    As per Merijn van Tilborg's comment, if you know the group sizes and the group means, the calculation is possible as shown here.