So I have the following data set (this is a small sample/example of what it looks like, with the original being 7k rows and 30 columns over 7 decades):
Year,Location,Population Total, Median Age, Household Total
2000, Adak, 220, 45, 67
2000, Akiachak, 567, NA, 98
2000, Rainfall, 2, NA, 11
1990, Adak, NA, 33, 56
1990, Akiachak, 456, NA, 446
1990, Tioga, 446, NA, NA
I want to create a summary table that indicates how many years of data is available by location for each variable. So something like this would work (for the small example from before):
Location,Population Total, Median Age, Household Total
Adak,1,2,2
Akiachak,2,0,2
Rainfall,1,0,1
Tioga,1,0,0
I'm new to R and haven't used these two commands together so I'm unsure of the syntax. Any help would be wonderful or alternatives.
A solution with summarize_all
from dplyr
:
library(dplyr)
df %>%
group_by(Location) %>%
summarize_all(funs(sum(!is.na(.)))) %>%
select(-Year)
Or you can use summarize_at
:
df %>%
group_by(Location) %>%
summarize_at(vars(-Year), funs(sum(!is.na(.))))
Result:
# A tibble: 4 x 4
Location PopulationTotal MedianAge HouseholdTotal
<chr> <int> <int> <int>
1 Adak 1 2 2
2 Akiachak 2 0 2
3 Rainfall 1 0 1
4 Tioga 1 0 0
Data:
df = read.table(text = "Year,Location,PopulationTotal, MedianAge, HouseholdTotal
2000, Adak, 220, 45, 67
2000, Akiachak, 567, NA, 98
2000, Rainfall, 2, NA, 11
1990, Adak, NA, 33, 56
1990, Akiachak, 456, NA, 446
1990, Tioga, 446, NA, NA", header = TRUE, sep = ",", stringsAsFactors = FALSE)
library(dplyr)
df = df %>%
mutate_at(vars(PopulationTotal:HouseholdTotal), as.numeric)