Search code examples
rdataframecategorical-datatapply

R - creating a variable that records categorical info of other member of a group


I have a dataset data of families, with every household identified by a variable id and every person identified with id + num (id of family + person of the family). For every person I have various demographic characteristics, like this:

id  num  age  wage     edu            marital_status
1   1    33   1200    Secondary       Married/Cohabitating
1   2    35   1100    College         Married/Cohabitating
1   3    12   -1      Not applicable  Not applicable
2   1    27   1600    College         Single
3   1    59   2000    Secondary       Married/Cohabitating
3   2    51   1800    Other           Married/Cohabitating

       

I have created a set of variables that record the characteristics of another member of the household. So, for example, I wanted to have a variable for 'wage of partner' wage_p for households with two married or cohabitating adults, which I obtained with

sums = tapply(data$wage, data$id, sum)
data$wage_tot = sums[match(data$id,names(sums))]
data$wage_tot[!(data$id %in% data$id[duplicated(data$id)])] = NA
data$wage_p = data$wage_tot - data$wage

Basically, I summed wage per household obtaining wage_tot and then subtracted wage to obtain wage_p.

Which worked because I first restricted the dataset to married or cohabitating individuals (so I had 1 or 2 individuals per household). (I know this is probably more convoluted than necessary).

My results:

id  num  age  wage     edu            marital_status        wage_tot   wage_p
1   1    33   1200    Secondary       Married/Cohabitating   2300      1100
1   2    35   1100    College         Married/Cohabitating   2300      1200
2   1    27   1600    College         Single                 NA         NA
3   1    59   2000    Secondary       Married/Cohabitating   3800      1800
3   2    51   1800    Other           Married/Cohabitating   3800      2000  

Now the problem comes when I want to do this with categorical variables, because I cannot get a total and then subtract like I did with the continuous variables. So for example, if I want to create a variable that records the level of education of the spouse, edu_p.

id  num  age  wage     edu            marital_status         edu_p
1   1    33   1200    Secondary       Married/Cohabitating   College
1   2    35   1100    College         Married/Cohabitating   Secondary
2   1    27   1600    College         Single                 NA
3   1    59   2000    Secondary       Married/Cohabitating   Other
3   2    51   1800    Other           Married/Cohabitating   Secondary 

The only idea that I could come up with would be to transform the categorical variable in numeric, use my method, and then transform them again but i'm sure it is waay more complicated that it has to be.

Can anyone help me out?


Solution

  • Consider a merge solution to compare each couple against each other using id. Final left join merge used to include not couple observations from original data.

    spouse_merge <- subset(merge(data, data, by="id", suffixes=c("", "_p")),
                           (num < num_p | num > num_p) & 
                           marital_status != "Not applicable" &
                           marital_status_p != "Not applicable")
    
    final_df <- merge(data, spouse_merge[c(1,2, grep("_p", names(spouse_merge)))], 
                      by=c("id", "num"), all.x=TRUE)
    final_df
    

    Online Demo