Search code examples
rdplyrdata-wrangling

Group by and sum/count of columns issues


I am trying to to group_by program names and count the total column amounts per program. There are NA's in the mix.

Taking a data frame like this: dataframe

and getting something like in return returned data

The following code only counts all non NA observations. It doesn't actually add up the numbers. Do I need to do some sort of ifelse here? I also wonder if the !is.na causes it to count all non NA observations, however, if I remove it, I get all NA's as my totals.

df %>%  
  group_by(ProgramName) %>%
  summarise(ED = sum(!is.na(HighSchool)), EMP = sum(!is.na(Employment)))

alternatively, is there is a way to group by program name and count the observations ONLY if they had a 1 in either column, not to tally up the total? That is closer to what I want anyway. Any support would be appreciated.


Solution

  • To answer both questions:

    library(dplyr)
    
    df <- structure(list(ProgramName = c("Program A", "Program A", "Program A", 
    "Program A", "Program B", "Program B", "Program B", "Program B", 
    "Program C", "Program C", "Program C", "Program C"), HighSchool = c(1L, 
    0L, 0L, 1L, 1L, 0L, 1L, NA, 1L, 1L, NA, 1L), Employment = c(0L, 
    0L, 1L, 0L, 1L, 1L, 1L, NA, 0L, 1L, NA, 1L)), class = "data.frame", row.names = c(NA, 
    -12L))
    
    
    df %>%  
      group_by(ProgramName) %>%
      summarise(across(HighSchool:Employment, ~ if(all(is.na(.))) NA else sum(., na.rm = TRUE)))
    
    # A tibble: 3 × 3
      ProgramName HighSchool Employment
      <chr>            <int>      <int>
    1 Program A            2          1
    2 Program B            2          3
    3 Program C            3          2
    
    # This is the one you state that you actually want
    df %>%  
      group_by(ProgramName) %>%
      summarise(ED = sum(HighSchool == 1, na.rm = TRUE),
                EMP = sum(Employment == 1, na.rm = TRUE))
    
    A tibble: 3 × 3
      ProgramName    ED   EMP
      <chr>       <int> <int>
    1 Program A       2     1
    2 Program B       2     3
    3 Program C       3     2