Search code examples
rdataframedplyrsummary

Calculate conditioned summary statistics using just one code in R


I have a data frame like df:

df <- data.frame(yr = c("2008", "2008", "2008", "2008", "2008", "2009", "2009"),
                 FM = c(1, 1, 1, 1, 0, 1, 0),
                 t1 = c(0, 1, 0, 1, 0, 1, 0),
                 t2 = c(0, 0, 1, 1, 0, 0, 1))

I would like to get 3 different statistisc:

  1. How many individuals classify as FM=1 or FM=0 by year
  2. Conditioned on FM=1, how many of them got t1=1 vs t1=0
  3. Conditioned on FM=1 and t1=1, how many of them got t2=1 vs t2=0

I do not know how to get these statistics in just one table, and perform these sequential calculations in just one code (I need for code clarity and because my dataset is enormous).

The independent steps would be as follows

df %>% 
  group_by(yr, FM) %>% 
  summarise(count_FM = n()) %>% 
  mutate(perc_FM = count_FM/sum(count_FM))


df %>% 
  filter(FM=="1") %>% 
  group_by(yr, t1) %>% 
  summarise(count_t1 = n()) %>% 
  mutate(perc_t1 = count_t1/sum(count_t1))


df %>% 
  filter(FM=="1" & t1=="1") %>% 
  group_by(yr, t2) %>% 
  summarise(count_t2 = n()) %>% 
  mutate(perc_t2 = count_t2/sum(count_t2))

Any idea?


Solution

  • The problem is that your summaries are based on different grouping levels, so to get an output suitable for printing you will need to organize your data into a nested table. R's data frames and tibbles do not lend themselves to such a structure, though it is possible with a bit of work and converting all the columns to character columns:

    library(tidyverse)
    
    nest(df, data = -(yr:FM)) %>%
      mutate(count_FM = unlist(map(data, nrow))) %>%
      group_by(yr) %>%
      mutate(perc_FM = count_FM / sum(count_FM)) %>%
      summarize(FM, count_FM, perc_FM, data, .groups = "keep",
                count_t1 = map(data, ~ .x %>% 
                                 count(t1, name = "count_t1") %>%
                                 mutate(perc_t1 = count_t1 / sum(count_t1)))) %>%
      unnest(count_t1) %>%
      summarize(FM, count_FM, perc_FM, t1, count_t1, perc_t1, .groups = "keep",
                count_t2 = map(data, ~ .x %>% 
                                 count(t2, name = "count_t2") %>%
                                 mutate(perc_t2 = count_t2 / sum(count_t2)))) %>%
      unnest(count_t2) %>%
      mutate(across(t2:perc_t2, ~ifelse(t1 == 0, "", as.character(.x)))) %>%
      group_by(yr, FM, t1) %>%
      mutate(across(count_t1:perc_t1, ~c(first(.x), rep("", n() - 1))),
             t1 = c(first(t1), rep("", n() - 1))) %>%
      group_by(yr, FM) %>%
      mutate(across(count_FM:perc_FM, ~c(first(.x), rep("", n() - 1))),
             FM = c(first(FM), rep("", n() - 1)),
             yr = c(first(yr), rep("", n() - 1))) %>%
      as.data.frame()
    #>     yr FM count_FM perc_FM t1 count_t1 perc_t1 t2 count_t2 perc_t2
    #> 1 2008  1        4     0.8  0        2     0.5                    
    #> 2                                                                 
    #> 3                           1        2     0.5  0        2     0.5
    #> 4                                               1        2     0.5
    #> 5 2008  0        1     0.2  0        1       1                    
    #> 6 2009  1        1     0.5  1        1       1  0        1       1
    #> 7 2009  0        1     0.5  0        1       1
    

    Created on 2023-02-21 with reprex v2.0.2