Search code examples
rdataframegroup-bydplyrsummary

How to use summarize and group by to create a summary table in r?


So I have the following data set (this is a small sample/example of what it looks like, with the original being 7k rows and 30 columns over 7 decades):

Year,Location,Population Total, Median Age, Household Total
2000, Adak, 220, 45, 67
2000, Akiachak, 567, NA, 98
2000, Rainfall, 2, NA, 11
1990, Adak, NA, 33, 56
1990, Akiachak, 456, NA, 446
1990, Tioga, 446, NA, NA

I want to create a summary table that indicates how many years of data is available by location for each variable. So something like this would work (for the small example from before):

Location,Population Total, Median Age, Household Total
Adak,1,2,2
Akiachak,2,0,2
Rainfall,1,0,1
Tioga,1,0,0

I'm new to R and haven't used these two commands together so I'm unsure of the syntax. Any help would be wonderful or alternatives.


Solution

  • A solution with summarize_all from dplyr:

    library(dplyr)
    df %>%
      group_by(Location) %>%
      summarize_all(funs(sum(!is.na(.)))) %>%
      select(-Year)
    

    Or you can use summarize_at:

    df %>%
      group_by(Location) %>%
      summarize_at(vars(-Year), funs(sum(!is.na(.))))
    

    Result:

    # A tibble: 4 x 4
       Location PopulationTotal MedianAge HouseholdTotal
          <chr>           <int>     <int>          <int>
    1      Adak               1         2              2
    2  Akiachak               2         0              2
    3  Rainfall               1         0              1
    4     Tioga               1         0              0
    

    Data:

    df = read.table(text = "Year,Location,PopulationTotal, MedianAge, HouseholdTotal
                    2000, Adak, 220, 45, 67
                    2000, Akiachak, 567, NA, 98
                    2000, Rainfall, 2, NA, 11
                    1990, Adak, NA, 33, 56
                    1990, Akiachak, 456, NA, 446
                    1990, Tioga, 446, NA, NA", header = TRUE, sep = ",", stringsAsFactors = FALSE)
    
    library(dplyr)    
    df = df %>%
      mutate_at(vars(PopulationTotal:HouseholdTotal), as.numeric)