Search code examples
rdplyrdata-manipulationdata-management

Create household head based on age and member id


I have a data frame of household members containing 3 integer columns, 'hid', 'sub', and 'age'. I'd like to create a new logical variable in the data frame called 'hh' representing the household head, defined as follows:

  1. If there is only 1 member in the household, then the value is TRUE,
  2. If there are 2 or more members in the household, then the household head is the one who is aged between 18 and 65 (inclusive) and has the smallest subject id ('sub') among those aged between 18 and 65.
  3. If there are no members in the household aged between 18 and 65, then the household head is the one with the smallest subject id.

There must be 1 and only 1 household head per household.

My data looks something like this:

# A tibble: 10 x 3
     hid   sub   age
   <dbl> <dbl> <dbl>
 1     1     1    75
 2     1     2    55
 3     2     1    35
 4     3     1    69
 5     3     2    72
 6     4     1    69
 7     5     1    15
 8     5     2    17
 9     5     3    42
10     6     1    72

And I'd like the result to be like this:

> result
# A tibble: 10 x 4
     hid   sub   age hh   
   <dbl> <dbl> <dbl> <lgl>
 1     1     1    75 FALSE  # Not 18-65 & there is another aged 18-65 within this household.
 2     1     2    55 TRUE   # Aged 18-65 and the smallest sub id within this household.
 3     2     1    35 TRUE   # Only 1 in this household.
 4     3     1    69 TRUE   # Not aged 18-65, but no other member is and smallest sub id.
 5     3     2    72 FALSE  # Not aged 18-65, and not the smallest sub id.
 6     4     1    69 TRUE   # Only 1 in this household.
 7     5     1    15 FALSE  # Not aged 18-65 and others in this household qualify.
 8     5     2    17 FALSE  # Not aged 18-65 and others in this household qualify.
 9     5     3    42 TRUE   # Aged 18-65 and the smallest sub id among those aged 18-65 within this household.
10     5     4    62 FALSE  # Aged 18-65 but not the smallest sub id among those aged 18-65 within this household.

Thank you!


d <- structure(list(hid = c(1, 1, 2, 3, 3, 4, 5, 5, 5, 5), 
                      sub = c(1, 2, 1, 1, 2, 1, 1, 2, 3, 4),
                      age = c(75, 55, 35, 69, 72, 69, 15, 17, 42, 62)), 
                 row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))

Solution

  • You can arrange the data in such a way that the first row of each group is the hh value you are looking for.

    library(dplyr)
    
    d %>%
      arrange(hid, !between(age, 18, 65), sub) %>%
      mutate(hh = !duplicated(hid)) 
    
    #     hid   sub   age hh   
    #   <dbl> <dbl> <dbl> <lgl>
    # 1     1     2    55 TRUE 
    # 2     1     1    75 FALSE
    # 3     2     1    35 TRUE 
    # 4     3     1    69 TRUE 
    # 5     3     2    72 FALSE
    # 6     4     1    69 TRUE 
    # 7     5     3    42 TRUE 
    # 8     5     4    62 FALSE
    # 9     5     1    15 FALSE
    #10     5     2    17 FALSE          
    

    !between(age, 18, 65) would arrange the data keeping the individuals aged 18-65 first before others who are outside the range.