I have census survey data at the household level. The data structure is like this: first household size is recorded and based on that date of birth, education level, work status, and other individual demographic information is been collected for each member.
Here is a dummy sample for household up to 4 members and only date of birth and working status (real data has household up to 10 members and 8 demos)
df <- tibble::tribble(
~id, ~House_member, ~dob_1, ~dob_2, ~dob_3, ~dob_4, ~work_1, ~work_2, ~work_3, ~work_4,
1L, 4L, 1983L, 1980L, 2009L, NA, 2L, 2L, NA, NA,
2L, 1L, 1940L, NA, NA, NA, 9L, NA, NA, NA,
3L, 2L, 1951L, 1951L, NA, NA, 9L, 9L, NA, NA,
4L, 4L, 1965L, 1973L, 2002L, NA, 2L, 2L, 8L, 2L,
5L, 3L, 1965L, 1948L, 2006L, NA, 2L, 9L, NA, NA,
6L, 1L, 1951L, NA, NA, NA, 9L, NA, NA, NA,
7L, 1L, 1955L, NA, NA, NA, 10L, NA, NA, NA,
8L, 4L, 1982L, 1978L, 2008L, NA, 2L, 2L, NA, NA,
9L, 2L, 1990L, 1997L, NA, NA, 2L, 8L, NA, NA,
10L, 2L, 1953L, 1957L, NA, NA, 2L, 2L, NA, NA
)
df
# A tibble: 10 x 10
id House_member dob_1 dob_2 dob_3 dob_4 work_1 work_2 work_3 work_4
<int> <int> <int> <int> <int> <lgl> <int> <int> <int> <lgl>
1 1 4 1983 1980 2009 NA 2 2 NA NA
2 2 1 1940 NA NA NA 9 NA NA NA
3 3 2 1951 1951 NA NA 9 9 NA NA
4 4 4 1965 1973 2002 NA 2 2 8 2
5 5 3 1965 1948 2006 NA 2 9 NA NA
6 6 1 1951 NA NA NA 9 NA NA NA
7 7 1 1955 NA NA NA 10 NA NA NA
8 8 4 1982 1978 2008 NA 2 2 NA NA
9 9 2 1990 1997 NA NA 2 8 NA NA
10 10 2 1953 1957 NA NA 2 2 NA NA
I am looking for a way to find how many households for each category (date of birth, work status) has some missing member information and summarise/report it. I am not sure what is the best way to extract this kind of insight from this data structure
For example looking at sample data, id == 1 is 4 members household but dob_4 is missing (work_4 is missing for this HH) same issue for id = 4 and dob_4 etc.
I found this old post but this is exactly what I am looking for.
This extended variant of the dplyr
+ tidyr
solution gives the id and person_id of the missing values:
df <- tibble::tribble(
~id, ~House_member, ~dob_1, ~dob_2, ~dob_3, ~dob_4, ~work_1, ~work_2, ~work_3, ~work_4,
1L, 4L, 1983L, 1980L, 2009L, NA, 2L, 2L, NA, NA,
2L, 1L, 1940L, NA, NA, NA, 9L, NA, NA, NA,
3L, 2L, 1951L, 1951L, NA, NA, 9L, 9L, NA, NA,
4L, 4L, 1965L, 1973L, 2002L, NA, 2L, 2L, 8L, 2L,
5L, 3L, 1965L, 1948L, 2006L, NA, 2L, 9L, NA, NA,
6L, 1L, 1951L, NA, NA, NA, 9L, NA, NA, NA,
7L, 1L, 1955L, NA, NA, NA, 10L, NA, NA, NA,
8L, 4L, 1982L, 1978L, 2008L, NA, 2L, 2L, NA, NA,
9L, 2L, 1990L, 1997L, NA, NA, 2L, 8L, NA, NA,
10L, 2L, 1953L, 1957L, NA, NA, 2L, 2L, NA, NA
)
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = matches("dob|work"),
names_to = c("name", "person"),
names_sep = "_") %>%
group_by(id, name) %>%
filter(person <= max(House_member)) %>%
filter(is.na(value)) %>%
select(id, name, person) %>%
arrange(id, name)
Returns:
# A tibble: 8 x 3 # Groups: id, name [6] id name person <int> <chr> <chr> 1 1 dob 4 2 1 work 3 3 1 work 4 4 4 dob 4 5 5 work 3 6 8 dob 4 7 8 work 3 8 8 work 4