Search code examples
rdplyrdata.tabletidyverse

How to calculate the number of children for each female in a dataset in R?


I'm working with a dataset in R and I'm trying to calculate the number of children for each female individual based on their relationship to the household head. The dataset includes variables such as Household ID, Individual ID, Relationship to the household head, Age, Gender, and Income.

 HouseholdID IndividualID Relationshiptothehouseholdhead  Age Gender  Income
    <dbl>  <dbl> <chr>        <dbl> <chr> <dbl>
 1      1      1 C               80 male      150
 2      1      2 D               81 female      120
 3      1      3 A               60 male      630
 4      1      4 B               59 female      500
 5      1      5 E3              35 male      380
 6      1      6 F3              30 female      220
 7      1      7 E5              33 female      170
 8      1      8 F5              30 male      160
 9      1      9 G32             20 female      290
10      1     10 G51             15 female      200
11      1     11 G52             12 female      100
12      1     12 G55              8 male       80
13      2      1 A               58 male      380
14      2      2 B               55 female      220
15      2      3 E1              35 male      170
16      2      4 F1              37 female      160
17      2      5 E2              33 male      290
18      2      6 F2              30 female      110
19      2      7 G21             17 female      210
20      2      8 G22             15 female      750
21      2      9 G23             12 female      350

The data structure provided in the table includes the following variables:

Household ID: This is a unique identifier for a family household.
Individual ID: This is a unique number assigned to each individual within the household.
Relationship to the household head: Specific symbols are used to represent the relationship of an individual to the head of the household.

  • "A" denotes the household head themselves;
  • "B" denotes the spouse of the household head;
  • "C" denotes the father of the household head;
  • "D" denotes the mother of the household head;
  • For the household head's children and their descendants, the notation "E1" is used for the first child, "E2" for the second child, and so on; "F1" is used for the spouse of the first child (E1), "F2" for the spouse of the second child (E2), and so forth.
  • For the grandchildren, "G11" denotes the first child of the first child (E1), "G12" the second child of the first child (E1), "G21" the first child of the second child (E2), and so on. "H11" is used for the spouse of the first grandchild (G11), and so on.

Age: The age of the individual.
Gender: The gender of the individual, represented by "male" or "female".
Income: The income situation of the individual.

Please generate a dataset similar to Table 2 based on the data from Table 1 with the following requirements:

  1. Only include female individuals.
  2. Calculate the number of children each female has given birth to.

It's important to note that the number of children is primarily determined by the highest number following the letter, rather than simply counting the number of observations in the data. For example, in household 1, the individual with ID equal to 4 should be considered as having 5 children, not 2.

The result should be as follows:

HouseholdID IndividualID  Age Gender  Income  Numofkids
1 2 81  female 120 1
1 4 59  female 500 5
1 6 30  female 220 2
1 7 33  female 170 3
1 9 35  female 290 0
1 10  15  female 200 0
1 11  12  female 100 0
2 2 55  female 220 2
2 4 37  female 160 0
2 6 30  female 110 3
2 7 17  female 210 0
2 8 15  female 750 0
2 9 12  female 350 0

Here is the data

data = structure(list(HouseholdID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2), IndividualID = c(1, 2, 3, 4, 
5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9), Relationshiptothehouseholdhead = c("C", 
"D", "A", "B", "E3", "F3", "E5", "F5", "G32", "G51", "G52", "G55", 
"A", "B", "E1", "F1", "E2", "F2", "G21", "G22", "G23"), Age = c(80, 
81, 60, 59, 35, 30, 33, 30, 20, 15, 12, 8, 58, 55, 35, 37, 33, 
30, 17, 15, 12), Gender = c("male", "female", "male", "female", 
"male", "female", "female", "male", "female", "female", "female", 
"male", "male", "female", "male", "female", "male", "female", 
"female", "female", "female"), Income = c(150, 120, 630, 500, 
380, 220, 170, 160, 290, 200, 100, 80, 380, 220, 170, 160, 290, 
110, 210, 750, 350)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -21L))

Thanks!


Solution

  • From a coding perspective this question is clear, even if the exercise itself involves simplifications and assumptions. However, I've assumed your comment about using the numbering rather than the counts extends to the household head's children's children, and so I get a different answer from you for Household 1, Individual 7. They are E5, and mother of G51, G52, G55, which implies 5 children not 3.

    library(data.table)
    dt <- as.data.table(data)
    
    # split the household relationship column into 3 for ease
    dt[, let(hh_rel   = substr(Relationshiptothehouseholdhead,1,1),
             gen2_idx = as.integer(substr(Relationshiptothehouseholdhead,2,2)),
             gen3_idx = as.integer(substr(Relationshiptothehouseholdhead,3,3)))]
    
    # family tree lookup tables (household head is gen1)
    gen2 <- dt[,
               .(gen2_max = max(gen2_idx, na.rm=TRUE)),
               by=HouseholdID]
    
    gen3 <- dt[!is.na( gen2_idx),
               .(gen3_max = max(gen3_idx, na.rm=TRUE)),
               by=.(HouseholdID, gen2_idx)]
    

    This gives the calculated no. of children born to the head of each household (generation 2), and the number of children (generation 3) born to those of gen 2 who are in the household (or whose spouses are):

    > gen2
       HouseholdID gen2_max
             <num>    <int>
    1:           1        5
    2:           2        2
    
    > gen3
       HouseholdID gen2_idx gen3_max
             <num>    <int>    <int>
    1:           1        3        2
    2:           1        5        5
    3:           2        1       NA
    4:           2        2        3
       
    

    Construct the output from these lookups:

    # women
    out <- dt[Gender == "female",
              .(HouseholdID,
                IndividualID,
                Age,
                Gender,
                Income,
                hh_rel,
                gen2_idx)]
    
    # if mother of HH, count 1 child
    out[hh_rel == "D", Numofkids := 1L]
    
    # if HH or wife of HH, highest 2nd gen index
    out[hh_rel %in% c("A", "B"), Numofkids := gen2[.SD, on=.(HouseholdID), gen2_max]]
    
    # if daughter or daughter-in-law of HH, highest 3rd gen index among own 2nd gen index
    out[hh_rel %in% c("E", "F"), Numofkids := gen3[.SD, on=.(HouseholdID, gen2_idx), gen3_max]]
    
    # otherwise 0
    out[is.na(Numofkids), Numofkids := 0L]
    
    # drop cols
    out[, let(hh_rel = NULL, gen2_idx = NULL)]
    

    Output:

    > out
        HouseholdID IndividualID   Age Gender Income Numofkids
              <num>        <num> <num> <char>  <num>     <int>
     1:           1            2    81 female    120         1
     2:           1            4    59 female    500         5
     3:           1            6    30 female    220         2
     4:           1            7    33 female    170         5
     5:           1            9    20 female    290         0
     6:           1           10    15 female    200         0
     7:           1           11    12 female    100         0
     8:           2            2    55 female    220         2
     9:           2            4    37 female    160         0
    10:           2            6    30 female    110         3
    11:           2            7    17 female    210         0
    12:           2            8    15 female    750         0
    13:           2            9    12 female    350         0