Search code examples
rdplyrdata-management

Calculate the descriptive statistics in a loop across several columns


I'm generating random numbers for one variable and repeating the process several times. I want to calculate the mean of value for each group (group1, group2, group3) in each iteration of the loop. I want to store the result so I afterward can estimate the mean share for each group across all iterations of the loop.

require(tidyverse)

set.seed(21)
group1 <- sample(c("A1", "A2", "B1", "B2", "C1", "C2"), 1000, TRUE)
group2 <- sample(c("G1", "G2", "G4"), 1000, TRUE)
group3 <- sample(c("D1", "D2"), 1000, TRUE)
prob <- runif(1000, 0, 1)
df <- as.data.frame(cbind(group1, group2, group3, prob))

df$prob <- as.numeric(df$prob)

for (i in 1:15) {
  
  df <- df %>%
    mutate(value = rbinom(nrow(df), 1, prob = prob))

  # [INSERT CALCULATION OF MEAN FOR EACH GROUP VARIABLE AND STORE IT]
}

# [INSERT CALCULATION OF MEAN ACROSS ALL ITERATIONS]

My main issue is how to estimate the mean of value across several variables in an efficient way and store the result in a smooth way.

Thanks in advance.

To clarify:

I want the end result to look something like this:

col "group1_A1" "Group1_A2" "group1_B1" "group1_B2" "group1_C1" "group1_C2" "group2_G1" "group2_G2" "group2_G4" "group3_D1" "group3_D2"
x1  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x2  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x3  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x4  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x4  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x5  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x6  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x7  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x8  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x9  "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"    
x10 "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"     "x_bar"

Where the three groups subgroups mean is replacing "x_bar and each row is one iteration calculated means. An easy solution would be to use dplyr's group_by but I want to find a solution so I go through all the three grouping variables.

To put this in a context: imagine that the variable prob is the probability of dying. group1 is a variable indicating 6 age groups, group2 is indicating on socioeconomic status, and group3 is gender. I then want to see who is most likely to die. To do so I randomly generate a Bernoulli variable that is dependent on the probability of prob. To remove some stochastic I repeat this process 15 times and then want to see how big a share of each sociodemographic group that died (received a value of 1 on the variable value. For each iteration, I want to calculate the group belonging of those who died (so how many males died, how many old people died). I'm sorry for not coming up with a more joyful example.


Solution

  • Here is an approach using some tidyverse functions.

    library(dplyr)
    library(tidyr)
    df2 <- df %>% 
      pivot_longer(starts_with("group") ) %>%
      mutate(group = paste0(name, "_", value)) %>%
      select(group)
    
    for (i in 1:15) {
      
      df2 <- df %>%
        mutate(value = rbinom(nrow(df), 1, prob = prob)) %>%
        pivot_longer(starts_with("group"), values_to = "val" ) %>%
        mutate(group = paste0(name, "_", val)) %>%
        group_by(group) %>%
        summarise(mean = mean(value, na.rm = TRUE)) %>%
        rename_with(.cols = mean, .fn = ~ paste0("mean", i)) %>%
        inner_join(df2, by = c("group" = "group"))
    }
    df2 %>%
      pivot_longer(starts_with("mean"), names_to = "trial", names_prefix = "mean") %>%
      distinct() %>%
      pivot_wider(id_cols = mean, names_from = "group", values_from = "value")
    # A tibble: 15 x 12
       trial group1_A1 group1_A2 group1_B1 group1_B2 group1_C1 group1_C2 group2_G1 group2_G2 group2_G4 group3_D1 group3_D2
       <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
     1 15        0.519     0.514     0.516     0.519     0.551     0.533     0.529     0.542     0.507     0.518     0.533
     2 14        0.481     0.486     0.503     0.536     0.487     0.550     0.526     0.493     0.507     0.495     0.520
     3 13        0.506     0.541     0.477     0.470     0.572     0.556     0.575     0.499     0.496     0.486     0.552
     4 12        0.519     0.534     0.497     0.557     0.604     0.509     0.549     0.522     0.548     0.548     0.531
     5 11        0.5       0.568     0.458     0.481     0.497     0.562     0.542     0.496     0.496     0.467     0.548
     6 10        0.525     0.466     0.503     0.503     0.535     0.580     0.581     0.490     0.496     0.488     0.548
     7 9         0.494     0.547     0.490     0.448     0.610     0.598     0.578     0.504     0.519     0.501     0.560
     8 8         0.538     0.554     0.471     0.530     0.599     0.538     0.545     0.516     0.559     0.565     0.518
     9 7         0.525     0.588     0.548     0.475     0.535     0.568     0.601     0.499     0.522     0.507     0.565
    10 6         0.513     0.527     0.529     0.546     0.561     0.503     0.562     0.513     0.522     0.510     0.550
    11 5         0.462     0.493     0.503     0.508     0.513     0.568     0.513     0.493     0.522     0.507     0.510
    12 4         0.506     0.5       0.452     0.481     0.599     0.556     0.545     0.516     0.496     0.520     0.516
    13 3         0.525     0.466     0.503     0.525     0.567     0.556     0.529     0.550     0.499     0.497     0.552
    14 2         0.462     0.554     0.471     0.514     0.519     0.574     0.536     0.516     0.499     0.520     0.512
    15 1         0.506     0.541     0.497     0.470     0.519     0.544     0.510     0.519     0.507     0.510     0.514
    

    This gets you your first part - a data.frame where each row is a trial with means per group.

    Your second part is as follows:

    df2 %>%
      pivot_longer(starts_with("mean"), names_to = "trial", names_prefix = "mean") %>%
      distinct() %>%
      group_by(group) %>%
      summarize(mean = mean(value))
    # A tibble: 11 x 2
       group      mean
       <chr>     <dbl>
     1 group1_A1 0.505
     2 group1_A2 0.525
     3 group1_B1 0.495
     4 group1_B2 0.504
     5 group1_C1 0.551
     6 group1_C2 0.553
     7 group2_G1 0.548
     8 group2_G2 0.511
     9 group2_G4 0.513
    10 group3_D1 0.509
    11 group3_D2 0.535