Search code examples
rdplyrtidyversedata-cleaning

Reshape data to find number of missing record/value in each category based on reference variable in R


I have census survey data at the household level. The data structure is like this: first household size is recorded and based on that date of birth, education level, work status, and other individual demographic information is been collected for each member.

Here is a dummy sample for household up to 4 members and only date of birth and working status (real data has household up to 10 members and 8 demos)

df <- tibble::tribble(
  ~id, ~House_member, ~dob_1, ~dob_2, ~dob_3, ~dob_4, ~work_1, ~work_2, ~work_3, ~work_4,
   1L,            4L,  1983L,  1980L,  2009L,     NA,      2L,        2L,      NA,      NA,
   2L,            1L,  1940L,     NA,     NA,     NA,      9L,        NA,      NA,      NA,
   3L,            2L,  1951L,  1951L,     NA,     NA,      9L,        9L,      NA,      NA,
   4L,            4L,  1965L,  1973L,  2002L,     NA,      2L,        2L,      8L,      2L,
   5L,            3L,  1965L,  1948L,  2006L,     NA,      2L,        9L,      NA,      NA,
   6L,            1L,  1951L,     NA,     NA,     NA,      9L,        NA,      NA,      NA,
   7L,            1L,  1955L,     NA,     NA,     NA,     10L,        NA,      NA,      NA,
   8L,            4L,  1982L,  1978L,  2008L,     NA,      2L,        2L,      NA,      NA,
   9L,            2L,  1990L,  1997L,     NA,     NA,      2L,        8L,      NA,      NA,
  10L,            2L,  1953L,  1957L,     NA,     NA,      2L,        2L,      NA,      NA
  )


df 
# A tibble: 10 x 10
      id House_member dob_1 dob_2 dob_3 dob_4 work_1 work_2 work_3 work_4
   <int>        <int> <int> <int> <int> <lgl>  <int>    <int>  <int> <lgl> 
 1     1            4  1983  1980  2009 NA         2        2     NA NA    
 2     2            1  1940    NA    NA NA         9       NA     NA NA    
 3     3            2  1951  1951    NA NA         9        9     NA NA    
 4     4            4  1965  1973  2002 NA         2        2      8 2    
 5     5            3  1965  1948  2006 NA         2        9     NA NA    
 6     6            1  1951    NA    NA NA         9       NA     NA NA    
 7     7            1  1955    NA    NA NA        10       NA     NA NA    
 8     8            4  1982  1978  2008 NA         2        2     NA NA    
 9     9            2  1990  1997    NA NA         2        8     NA NA    
10    10            2  1953  1957    NA NA         2        2     NA NA 

I am looking for a way to find how many households for each category (date of birth, work status) has some missing member information and summarise/report it. I am not sure what is the best way to extract this kind of insight from this data structure

For example looking at sample data, id == 1 is 4 members household but dob_4 is missing (work_4 is missing for this HH) same issue for id = 4 and dob_4 etc.

I found this old post but this is exactly what I am looking for.


Solution

  • This extended variant of the dplyr + tidyr solution gives the id and person_id of the missing values:

    df <- tibble::tribble(
      ~id, ~House_member, ~dob_1, ~dob_2, ~dob_3, ~dob_4, ~work_1, ~work_2, ~work_3, ~work_4,
      1L,            4L,  1983L,  1980L,  2009L,     NA,      2L,        2L,      NA,      NA,
      2L,            1L,  1940L,     NA,     NA,     NA,      9L,        NA,      NA,      NA,
      3L,            2L,  1951L,  1951L,     NA,     NA,      9L,        9L,      NA,      NA,
      4L,            4L,  1965L,  1973L,  2002L,     NA,      2L,        2L,      8L,      2L,
      5L,            3L,  1965L,  1948L,  2006L,     NA,      2L,        9L,      NA,      NA,
      6L,            1L,  1951L,     NA,     NA,     NA,      9L,        NA,      NA,      NA,
      7L,            1L,  1955L,     NA,     NA,     NA,     10L,        NA,      NA,      NA,
      8L,            4L,  1982L,  1978L,  2008L,     NA,      2L,        2L,      NA,      NA,
      9L,            2L,  1990L,  1997L,     NA,     NA,      2L,        8L,      NA,      NA,
      10L,            2L,  1953L,  1957L,     NA,     NA,      2L,        2L,      NA,      NA
    )
    
    
    
    library(dplyr)
    library(tidyr)
    
    df %>% 
      pivot_longer(cols = matches("dob|work"),
                   names_to = c("name", "person"),
                   names_sep = "_") %>% 
      group_by(id, name) %>% 
      filter(person <= max(House_member)) %>% 
      filter(is.na(value)) %>% 
      select(id, name, person) %>% 
      arrange(id, name)
    

    Returns:

    # A tibble: 8 x 3
    # Groups:   id, name [6]
      id name  person
      <int> <chr> <chr> 
    1     1 dob   4     
    2     1 work  3     
    3     1 work  4     
    4     4 dob   4     
    5     5 work  3     
    6     8 dob   4     
    7     8 work  3     
    8     8 work  4