Search code examples
rdplyrtidyverse

How to create subgroups based on group relationship criteria


Context: I have a dataframe of individual people grouped by household, which includes relationship parameters for each individual describing their relationship to every other individual in the household.

Goal: I am trying to create sub-groups within households according to tax-units. Individuals are considered a single tax-unit if they are (1) a spouse or (2) a dependent child. A dependent child is defined as a child under 18 or a child who is under 23 and a student.

There may be a single tax-unit or multiple tax-units within a given household. Other married couples, or individuals in each household who are not dependent children, form separate tax-units.

Example Dataframe:

   household     name age student                r01         r02          r03               r04          r05
1          1     john  60       0               <NA>      spouse       parent            parent       parent
2          1     mary  56       0             spouse        <NA>       parent            parent       parent
3          1    fiona  25       0              child       child         <NA>           sibling      sibling
4          1      tim  20       1              child       child      sibling              <NA>      sibling
5          1     nora  16       0              child       child      sibling           sibling         <NA>
6          2 terrence  58       0               <NA>      spouse child-in-law step-child-in-law       parent
7          2  siobhan  57       0             spouse        <NA>        child        step-child       parent
8          2      jim  90       0      parent-in-law      parent         <NA>            spouse grand-parent
9          2    maire  87       0 step-parent-in-law step-parent       spouse              <NA>        other
10         2     eoin  21       1              child       child  grand-child             other         <NA>
11         3   ronald  50       0               <NA>        <NA>         <NA>              <NA>         <NA>

Code to reproduce:

df <- data.frame(household = c(rep(1,5), rep(2,5), 3),
           name = c("john", "mary", "fiona", "tim", "nora", "terrence", "siobhan", "jim", "maire", "eoin", "ronald"),
           age = c(60, 56, 25, 20, 16, 58, 57, 90, 87, 21, 50),
           student = c(0,0,0,1,0,0,0,0,0,1,0),
           r01 = c(NA, "spouse", rep("child",3), NA, "spouse", "parent-in-law", "step-parent-in-law", "child", NA),
           r02 = c("spouse", NA, rep("child", 3), "spouse", NA, "parent", "step-parent", "child", NA),
           r03 = c(rep("parent",2), NA, rep("sibling", 2), "child-in-law", "child", NA, "spouse", "grand-child", NA),
           r04 = c(rep("parent",2), "sibling", NA, "sibling", "step-child-in-law", "step-child", "spouse", NA, "other", NA),
           r05 = c(rep("parent", 2), rep("sibling",2), NA, rep("parent", 2), "grand-parent", "other", NA, NA))

Approach: To start I created variables to list family member order, and to identify a dependent child.

df <- df %>%
  group_by(household) %>%
  mutate(fam_mem = row_number(),
         dep_child = ifelse(age < 18 | (age < 23 & student == 1), 1, 0))

My next step was to then identify parents of dependent children, using match however this is where I am getting stuck, as match will tell me if they are a parent, but I cannot link it to dependence status.

After this I hoped sort by dependence status and use lag to create a new household variable name, which groups into tax-units, e.g. 1a, 2a, 2b, 3a.

Desired Output

   household name       age student r01                r02         r03          r04               r05          fam_mem dep_child household_tax_unit
       <dbl> <chr>    <dbl>   <dbl> <chr>              <chr>       <chr>        <chr>             <chr>          <int>     <dbl> <chr>             
 1         1 john        60       0 NA                 spouse      parent       parent            parent             1         0 1a                
 2         1 mary        56       0 spouse             NA          parent       parent            parent             2         0 1a                
 3         1 fiona       25       0 child              child       NA           sibling           sibling            3         0 1b                
 4         1 tim         20       1 child              child       sibling      NA                sibling            4         1 1a                
 5         1 nora        16       0 child              child       sibling      sibling           NA                 5         1 1a                
 6         2 terrence    58       0 NA                 spouse      child-in-law step-child-in-law parent             1         0 2a                
 7         2 siobhan     57       0 spouse             NA          child        step-child        parent             2         0 2a                
 8         2 jim         90       0 parent-in-law      parent      NA           spouse            grand-parent       3         0 2b                
 9         2 maire       87       0 step-parent-in-law step-parent spouse       NA                other              4         0 2b                
10         2 eoin        21       1 child              child       grand-child  other             NA                 5         1 2a                
11         3 ronald      50       0 NA                 NA          NA           NA                NA                 1         0 3a  

John and Mary share a tax unit with their dependent children Tim and Nora, while Fiona is given her own tax-unit as she is not a dependent child.

Terrence and Siobbhan are married and share a tax unit with their dependent child Eoin, while Jim and Maire share a different tax-unit as their child/step-child is not a dependent and they are married.

Ronald lives alone therefore is a single tax-unit.


Solution

  • The memb function creates an incidence matrix mm from the r columns and dep_child. The line with the logical condition marked ## is the key line and the following line ensures that mm is symmetric. Now given mm form an igraph graph g. Finally create a membership vector as the output of the memb function.

    library(dplyr)
    library(igraph)
    
    memb <- function(dep_child, r) {
      m <- as.matrix(r)[, 1:nrow(r), drop = FALSE]
      m[is.na(m)] <- ""
      mm <- m == "spouse" | (m == "child" & dep_child)  ##
      mm <- mm | t(mm)
      g <- graph_from_adjacency_matrix(mm, mode = "undirected")
      components(g)$membership
    }
    
    df %>%
      mutate(unit = row_number(),
             dep_child = +(age < 18 | (age < 23 & student == 1)),
             memb = memb(dep_child, pick(starts_with("r"))), 
             memb = paste0(household, letters[memb]), .by = household)
    

    giving

       household     name age student                r01         r02          r03
    1          1     john  60       0               <NA>      spouse       parent
    2          1     mary  56       0             spouse        <NA>       parent
    3          1    fiona  25       0              child       child         <NA>
    4          1      tim  20       1              child       child      sibling
    5          1     nora  16       0              child       child      sibling
    6          2 terrence  58       0               <NA>      spouse child-in-law
    7          2  siobhan  57       0             spouse        <NA>        child
    8          2      jim  90       0      parent-in-law      parent         <NA>
    9          2    maire  87       0 step-parent-in-law step-parent       spouse
    10         2     eoin  21       1              child       child  grand-child
    11         3   ronald  50       0               <NA>        <NA>         <NA>
                     r04          r05 unit dep_child memb
    1             parent       parent    1         0   1a
    2             parent       parent    2         0   1a
    3            sibling      sibling    3         0   1b
    4               <NA>      sibling    4         1   1a
    5            sibling         <NA>    5         1   1a
    6  step-child-in-law       parent    1         0   2a
    7         step-child       parent    2         0   2a
    8             spouse grand-parent    3         0   2b
    9               <NA>        other    4         0   2b
    10             other         <NA>    5         1   2a
    11              <NA>         <NA>    1         0   3a