Search code examples
rselectdplyrfilterleft-join

How link information of one column with each other in R


I have a dataset with approximately 1.5 million individual who can be distinguished in their household with “Household ID”. The dataset has a column (relationship) that specifies the relationship between people in the household based on the relationship they have with the head of household.

    library(tidyverse)
    sample <- tibble( 
    household.ID = c(11015015988, 11015015988, 11015015988 , 
    11015015988 , 11015015988, 
    11015015988, 11015015988, 11015015988,228979641, 
    228979641, 228979641 ,228979641),
    member.ID= c(1101502683 ,11015026954,11015027098,11015027231 
    ,11015027353,11015027484 
    ,11015027615 ,11015027751,228992311,228996137,229001877,229005869),
    relationship = c(1,2,3,3,3,3,2,3,1,3,2,2),
    gender = c(1,2,1,2,1,1,2,2,1, 2, 2 , 2),
    age = c(54,54,30,26,23,31,20,2, 60,12,34,62),
    marriage.status= c(1,1,4,4,4,1,1,NA, 1, 4, 1,1),
    children.ever.born= c(NA,8,NA,NA,NA,NA,1,NA,NA,NA,1,1),
    living.children  = c(NA,8,NA,NA,NA,NA,1,NA,NA,NA,1,1))

“code 1” refers to the head of the household; “code 2” is the wife of the head; “code 3” is the child of the head; therefore, anyone with “code 2” within the household is the mother of “code 3”. I needed to link “code 2” and “code 3” so that within each household ID, relationship values of 3 identify the member ID from the same household with a relationship value of 2. With the code below I am able to do so.

    sample2 <- select(sample, 1:8) 
    spouse_links <- left_join(relationship = "many-to-many",
    sample2 %>% filter(relationship == 3),
    sample2 %>% filter(relationship == 2, !is.na(living.children)) %>%
    rename(member.ID.mother = member.ID), join_by(household.ID)) %>%
    filter(!is.na(relationship.y)) %>%
    select(1:8, member.ID.mother)

However, I have a problem with households in which polygamy is the custom (the practice of marrying multiple spouses). In such a household, we face two or more “code 2” in the relationship column; as a result, the R codes link all “code 3” to both “code 2”. To clarify, consider the following table:

enter image description here

According to the table, we can find out that children are linked to their mothers based on proximity in the household. For instance, row 2 is the mother of children in rows 3, 4, 5, and 6, and row 7 is the mother of row 8; however, this is not always the case and sometimes the age of mother and the "living children" can be helpful. I need to reach a table like the following:

enter image description here

I would really appreciate any help you can provide.


Solution

  • In the OP, it's unclear what information the member.ID sort order is providing. It's suggested that the sort order is relevant to indicate that those children reported after a mother is reported are the children of that previously-reported mother -- except in cases where that's not true.

    I think this is a problematic assumption and we should either know it's reliable (because the dataset info sheets tell us it is); or its not.

    If it is reliable, then when children are reported before any mothers are reported, it might imply that child's mother is not alive or simply not living in the household. In which case, it questions the OP assumption that all children in the household have mothers in the household and how the analysis should be performed?

    If the member.ID sort order is not reliable, then using it at all for any assignment has no merit and any assignments would be coincidental or reflect the casual order in which survey enumerators suveyed the household.

    The solution below provides an answer for the "first mother" as requested in OP assuming the sort order of member.ID is meaningful for completeness. But I favor a heuristic approach based on age and number of children ever born.

    library(data.table)
    setDT(sample)
    

    Defined the mother.ID variable and subset the mothers and children in the dataset

    sample[,mother.ID:=as.numeric(NA)]
    mothers <- sample[relationship==2]
    children <- sample[relationship==3]
    

    For those mothers who've never had children filter them out as possbilities

    mothers[is.na(children.ever.born),
            children.ever.born:=0]
    
    mothers <- mothers[children.ever.born>=0]
    

    Within each household, cross join all possible mothers and all possible children data.table's .EACHI nomenclature enables this within-group cross join in a memory-efficient (fast) manner

    motherschildren <- children[,
                                .(household.ID,
                                  child.ID=member.ID,
                                  child.age=age)][mothers[,.(household.ID,
                                                             mother.ID=member.ID,
                                                             mother.age=age,
                                                             children.ever.born)],
                                        .(mother.ID,
                                          child.ID,
                                          child.age,
                                          mother.age,
                                          children.ever.born),
                                        on=.(household.ID),
                                        by=.EACHI]
    

    For the sake of completeness for the OP, if we do assume the member.ID sort order is relevant, then assign first.mother.ID to each child. And then continue with the remaining solution to assign the most likely mother to each child where first.mother.ID yields NA. But unless the data specs explicitly state this, I don't think it's a strong assumption.

    if (FALSE) {   
          setorder(motherschildren,
                -household.ID,
                child.ID,
                -mother.ID)
           motherschildren[,
                        first.mother.ID:=mother.ID[mother.ID < child.ID][1],
                        by=.(child.ID)]
            sample[unique(motherschildren[,.(household.ID,
                                         child.ID,
                                         first.mother.ID)]),
               mother.ID:=i.first.mother.ID,
               on=.(household.ID,
                    member.ID=child.ID)]
            motherschildren <- motherschildren[is.na(first.mother.ID)==TRUE] 
    }
    

    In the absence of known information about the dataset and how the data was collected and represented in the dataset, then we must assign children to mothers based on the best guess. Age is a good factor that can certainly rule-out possibilities (eg, child is older than the mother).

    Create an "age.score" based on the age difference between mother and child, whereby children are heuristically ranked to the mother according to prime child-baring age, eg, given two women where the one is 25 years old at child's birth and another is 37, assume the 25-year old gave birth.

    motherschildren[,age.difference:=mother.age-child.age]
    
    motherschildren[,
                    age.score:=fcase(age.difference <= 0,as.numeric(NA),
                                     age.difference >= 20 & age.difference <= 30,1,
                                     age.difference >= 17,2,
                                     default=3)]
    

    It might be interesting to see if there are any datapoints in this set of improbable children?

    motherschildren[age.score==3]
    

    Filter out impossible mothers based on age

    motherschildren <- motherschildren[is.na(age.score)==FALSE]
    

    In addition to age, we also know how many children each mother has given birth to. Create a rank.score to filter out possible children beyond the number the woman has given birth to. Eg, if she's given birth to 2 children and there are 10 children then the household, we know that she cannot be the mother for more than two. The rank.score will identify the first number of children (2, in this example), based on the age.score. This is a questionable assumption though, since all the women in the household are unlikley to have children sequentially. But I think it's a better assumption than nothing at all. Possibly scoring using <= children.ever.born+1 could allow some more uncertainty across different potential mothers?

    Sort the dataset by each mother's possible children relative to the child's age (ie, if she gave birth to that child, then she will have given birth to the oldest child first and the youngest child most recently)

    setorder(motherschildren,
             household.ID,
             mother.ID,
             -child.age)
    
    motherschildren[,
                    rank.score:=FALSE]
    
    motherschildren[,
                    rank.score:= (1:.N) <= children.ever.born, #children.ever.born+1 may be a better assumption?
                    by=.(household.ID,
                         mother.ID)]
    

    Filter out possible children whose rank score exceeds the number of children the mother has born.

    motherschildren <- motherschildren[rank.score==TRUE]
    

    Sort the dataset by each child's possible mother, according to the age.score where the possible mother with the highest age.score will be ordered first

    Group by each household and each child and select the first possible mother as the most likely mother to have given birth to that child.

    setorder(motherschildren,
             household.ID,
             child.ID,
             age.score)
    
    motherschildren <- motherschildren[,
                                       .(mother.ID=mother.ID[1]),
                                       by=.(household.ID,
                                            child.ID)]
    

    Join the selected mother.ID value back into the overall dataset.

    sample[motherschildren,
           mother.ID:=i.mother.ID,
           on=.(household.ID,
                member.ID=child.ID)]
    

    It might be interesting to see if there are any datapoints in this subset?

    sample[relationship==3 & is.na(mother.ID)]
    

    For data analysis purposes, creating a mother.group variable of all identified mothers and their children as a column variable

    sample[,mother.group:=as.numeric(NA)]
    sample[relationship != 1,
           mother.group:=.GRP,
           by=.(household.ID,
                ifelse(is.na(mother.ID),
                       member.ID,
                       mother.ID))]
    
    setorder(sample,
             -household.ID,
             mother.group,
             relationship)