Search code examples
rsummarize

How to compute the mean of counts of non-missing values?


I would like to, first, compute the group number of non-missing values of a specific column of a data frame and then compute its mean. Basically I would like information on the average of the group count of non-missing values (a single value).

I managed to compute the group count of non-missing value, but not its average (single value). The code below is ok except for the last row (which I commented out as it gives me the wrong output).

data <- tibble(hosp     = c("1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3"), 
             from     = c("A", "A", "B", "B", "C", "C", "C", "A", "A", "B", "B", "D", "D", "D", "B", "E", "E", "E", "E"), 
             to       = c("C", "B", "C", "A", "B", "A", "B", "D", "B", "A", "D", "A", "B", "B", "E", "B", "B", "B", "B"),
             hosp_ind = c("" , "3", "" , "" , "2", "2", "3", "" , "3", "" , "" , "1", "1", "3", "" , "1", "1", "2", "2"),
             to_ind   = c("" , "E", "" , "" , "D", "D", "E", "" , "E", "" , "" , "C", "C", "E", "" , "A", "C", "A", "D")) 

summary <- data %>%
  group_by(hosp, from, to) %>%
  summarise(N_iv = sum(!is.na(to_ind))) %>%
  #summarise(mean(N_iv))

Solution

  • I guess what you try to do is that. You have to ungroup before summarise:

    
    (
      data
      %>% group_by(hosp, from, to)
      %>% mutate(
        hosp_ind = na_if(hosp_ind, ""), 
        to_ind = na_if(to_ind, "") )
      %>% summarise(
        N_iv = sum(!is.na(to_ind)))
      %>% ungroup
      %>% summarise(mean(N_iv))
    )
    

    Output:

    # A tibble: 1 x 1
      `mean(N_iv)`
             <dbl>
    1        0.857
    

    Note that the empty string "" is not the same thing as NA. That is why I added those lines:

    %>% mutate(
        hosp_ind = na_if(hosp_ind, ""), 
        to_ind = na_if(to_ind, "") )
    

    Another way to do that is to pull the column N_iv to compute its mean:

    (
      data
      %>% group_by(hosp, from, to)
      %>% mutate(
        hosp_ind = na_if(hosp_ind, ""), 
        to_ind = na_if(to_ind, "") )
      %>% summarise(
        N_iv = sum(!is.na(to_ind)))
      %>% pull(N_iv)
      %>% mean
    )