Search code examples
rdplyrstatisticssummaryvariance

Writing a pooled variance/standard deviation function that works with dplyr::group_by() R tidyverse


Coding Objective:

Make a tidy compatible function that I can pipe a df into to calculate pooled standard deviation. It needs to respect group_by() groupings as well for the summarise() output.

Underlying Objective:

Report pooled standard deviation statistics for each treatment x timepoint x condition sub-grouping when there is unequal sampling. It is assumed that all the subjects come from the same population.

I think I want to use this pooled standard deviation formula, but am uncertain:

enter image description here

The Data:

For each sub-condition, I calculated the arithmetic lognormal variance for each subject as well as the number of observations.

enter image description here

df <- structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), intervention = c("trt", 
"trt", "trt", "trt", "ctrl", "ctrl", "ctrl", "ctrl", "trt", "trt", 
"trt", "trt", "trt", "trt", "trt", "trt", "trt", "trt", "trt", 
"trt", "ctrl", "ctrl", "ctrl", "ctrl", "ctrl", "ctrl", "ctrl", 
"ctrl", "ctrl", "ctrl", "ctrl", "ctrl"), timepoint = c("t1", 
"t1", "t2", "t2", "t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2", 
"t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2", "t1", "t1", "t2", 
"t2", "t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2"), lognormal_var = c(0.16, 
0.218, 0.15, 0.368, 0.262, 0.283, 0.21, 0.218, 0.167, 0.211, 
0.205, 0.278, 0.261, 0.301, 0.173, 0.313, 0.187, 0.27, 0.168, 
0.262, 0.162, 0.272, 0.202, 0.195, 0.176, 0.181, 0.152, 0.257, 
0.144, 0.163, 0.284, 0.269), n = c(1223, 1011, 830, 597, 1048, 
1332, 672, 380, 623, 400, 1006, 820, 967, 752, 799, 884, 945, 
941, 675, 395, 815, 787, 823, 753, 861, 727, 1508, 1255, 762, 
791, 1382, 939), condition = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 
1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 
2)), row.names = c(NA, -32L), class = c("tbl_df", "tbl", "data.frame"
))

I have 8 subjects, 4 in each treatment condition. Each subject is repeatedly sampled at 2 timepoints, and within each timepoint they are assessed in 2 different conditions. Within each timepoint x condition, subjects are variably sampled between a couple hundred to a couple thousand times.

What I tried:

Latest Attempt Edit: 3rd attempt

This doesn't respect the groupings from the group_by(). The column pooled_lognormal_sd is correct per the solution provided by u/Ricardo Semião e Castro .

pooled_var <- function(.data){
 var <- .data$lognormal_var
 sqrt(sum(var * ( .data$n - 1 )) / (sum(.data$n) - nrow(.data)))
}

df |>
  group_by(intervention, timepoint, condition) %>%
  summarise(pooled_var(.),
            pooled_lognormal_sd = sqrt(sum(lognormal_var * ( n - 1 )) / (sum(n) - n())))

enter image description here

Log-normal variance function:

lognormal_var <- function(x) {
  exp(2*mean(log(x)) + var(log(x)))*(exp(var(log(x))) - 1)
}

1st attempt:

pooled_var <- function(df,
                       var = "lognormal_var"){
 var <- if(var %in% names(df)) df$var else print0("Variable name not found in df")
 sum(var * ( df$n - 1 )) / (sum(df$n) - nrow(df))
}
library(dplyr)
df |>
  group_by(intervention, timepoint, condition) %>%
  pooled_var(var = "lognormal_var") |>
  sqrt()

Which yields the error: "Warning: Unknown or uninitialised column: var.2 0". I expect 8 rows to be returned for each combination of timepoint, treatment, and condition.


Solution

  • You created the groups but then didn't used a dplyr verb to apply your function to every group, that's why you only got one value. You want to use summarise.

    Also, as you're using a tidyverse approach, you don't need to make a function that takes the column as a string. dplyr is built around calling column directly.

    pooled_var <- function(var, n, k) sqrt(sum(var * ( n - 1 )) / (sum(n) - k))
    
    df %>%
      group_by(intervention, timepoint, condition) %>%
      summarise(sigma = pooled_var(lognormal_var, n, n()))
    

    Now, lognormal_var isn't the whole df column, but respects the groups. The same is true for n and the number of lines in each group (which I imagine is what "k" should be), n().

    Result:

    # A tibble: 8 × 4
    # Groups:   intervention, timepoint [4]
      intervention timepoint condition sigma
      <chr>        <chr>         <dbl> <dbl>
    1 ctrl         t1                1 0.438
    2 ctrl         t1                2 0.484
    3 ctrl         t2                1 0.460
    4 ctrl         t2                2 0.492
    5 trt          t1                1 0.440
    6 trt          t1                2 0.503
    7 trt          t2                1 0.419
    8 trt          t2                2 0.554
    

    Obs: I'm not 100% sure if this is respecting your formula, you should check if the result is reasonable.