Context: I have a dataframe of individual people grouped by household, which includes relationship parameters for each individual describing their relationship to every other individual in the household.
Goal: I am trying to create sub-groups within households according to tax-units. Individuals are considered a single tax-unit if they are (1) a spouse or (2) a dependent child. A dependent child is defined as a child under 18 or a child who is under 23 and a student.
There may be a single tax-unit or multiple tax-units within a given household. Other married couples, or individuals in each household who are not dependent children, form separate tax-units.
Example Dataframe:
household name age student r01 r02 r03 r04 r05
1 1 john 60 0 <NA> spouse parent parent parent
2 1 mary 56 0 spouse <NA> parent parent parent
3 1 fiona 25 0 child child <NA> sibling sibling
4 1 tim 20 1 child child sibling <NA> sibling
5 1 nora 16 0 child child sibling sibling <NA>
6 2 terrence 58 0 <NA> spouse child-in-law step-child-in-law parent
7 2 siobhan 57 0 spouse <NA> child step-child parent
8 2 jim 90 0 parent-in-law parent <NA> spouse grand-parent
9 2 maire 87 0 step-parent-in-law step-parent spouse <NA> other
10 2 eoin 21 1 child child grand-child other <NA>
11 3 ronald 50 0 <NA> <NA> <NA> <NA> <NA>
Code to reproduce:
df <- data.frame(household = c(rep(1,5), rep(2,5), 3),
name = c("john", "mary", "fiona", "tim", "nora", "terrence", "siobhan", "jim", "maire", "eoin", "ronald"),
age = c(60, 56, 25, 20, 16, 58, 57, 90, 87, 21, 50),
student = c(0,0,0,1,0,0,0,0,0,1,0),
r01 = c(NA, "spouse", rep("child",3), NA, "spouse", "parent-in-law", "step-parent-in-law", "child", NA),
r02 = c("spouse", NA, rep("child", 3), "spouse", NA, "parent", "step-parent", "child", NA),
r03 = c(rep("parent",2), NA, rep("sibling", 2), "child-in-law", "child", NA, "spouse", "grand-child", NA),
r04 = c(rep("parent",2), "sibling", NA, "sibling", "step-child-in-law", "step-child", "spouse", NA, "other", NA),
r05 = c(rep("parent", 2), rep("sibling",2), NA, rep("parent", 2), "grand-parent", "other", NA, NA))
Approach: To start I created variables to list family member order, and to identify a dependent child.
df <- df %>%
group_by(household) %>%
mutate(fam_mem = row_number(),
dep_child = ifelse(age < 18 | (age < 23 & student == 1), 1, 0))
My next step was to then identify parents of dependent children, using match
however this is where I am getting stuck, as match
will tell me if they are a parent, but I cannot link it to dependence status.
After this I hoped sort by dependence status and use lag
to create a new household variable name, which groups into tax-units, e.g. 1a
, 2a
, 2b
, 3a
.
Desired Output
household name age student r01 r02 r03 r04 r05 fam_mem dep_child household_tax_unit
<dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <chr>
1 1 john 60 0 NA spouse parent parent parent 1 0 1a
2 1 mary 56 0 spouse NA parent parent parent 2 0 1a
3 1 fiona 25 0 child child NA sibling sibling 3 0 1b
4 1 tim 20 1 child child sibling NA sibling 4 1 1a
5 1 nora 16 0 child child sibling sibling NA 5 1 1a
6 2 terrence 58 0 NA spouse child-in-law step-child-in-law parent 1 0 2a
7 2 siobhan 57 0 spouse NA child step-child parent 2 0 2a
8 2 jim 90 0 parent-in-law parent NA spouse grand-parent 3 0 2b
9 2 maire 87 0 step-parent-in-law step-parent spouse NA other 4 0 2b
10 2 eoin 21 1 child child grand-child other NA 5 1 2a
11 3 ronald 50 0 NA NA NA NA NA 1 0 3a
John and Mary share a tax unit with their dependent children Tim and Nora, while Fiona is given her own tax-unit as she is not a dependent child.
Terrence and Siobbhan are married and share a tax unit with their dependent child Eoin, while Jim and Maire share a different tax-unit as their child/step-child is not a dependent and they are married.
Ronald lives alone therefore is a single tax-unit.
The memb
function creates an incidence matrix mm
from the r
columns and dep_child
. The line with the logical condition marked ## is the key line and the following line ensures that mm
is symmetric. Now given mm
form an igraph graph g
. Finally create a membership vector as the output of the memb
function.
library(dplyr)
library(igraph)
memb <- function(dep_child, r) {
m <- as.matrix(r)[, 1:nrow(r), drop = FALSE]
m[is.na(m)] <- ""
mm <- m == "spouse" | (m == "child" & dep_child) ##
mm <- mm | t(mm)
g <- graph_from_adjacency_matrix(mm, mode = "undirected")
components(g)$membership
}
df %>%
mutate(unit = row_number(),
dep_child = +(age < 18 | (age < 23 & student == 1)),
memb = memb(dep_child, pick(starts_with("r"))),
memb = paste0(household, letters[memb]), .by = household)
giving
household name age student r01 r02 r03
1 1 john 60 0 <NA> spouse parent
2 1 mary 56 0 spouse <NA> parent
3 1 fiona 25 0 child child <NA>
4 1 tim 20 1 child child sibling
5 1 nora 16 0 child child sibling
6 2 terrence 58 0 <NA> spouse child-in-law
7 2 siobhan 57 0 spouse <NA> child
8 2 jim 90 0 parent-in-law parent <NA>
9 2 maire 87 0 step-parent-in-law step-parent spouse
10 2 eoin 21 1 child child grand-child
11 3 ronald 50 0 <NA> <NA> <NA>
r04 r05 unit dep_child memb
1 parent parent 1 0 1a
2 parent parent 2 0 1a
3 sibling sibling 3 0 1b
4 <NA> sibling 4 1 1a
5 sibling <NA> 5 1 1a
6 step-child-in-law parent 1 0 2a
7 step-child parent 2 0 2a
8 spouse grand-parent 3 0 2b
9 <NA> other 4 0 2b
10 other <NA> 5 1 2a
11 <NA> <NA> 1 0 3a