I have a dataset data
of families, with every household identified by a variable id
and every person identified with id + num
(id of family + person of the family). For every person I have various demographic characteristics, like this:
id num age wage edu marital_status
1 1 33 1200 Secondary Married/Cohabitating
1 2 35 1100 College Married/Cohabitating
1 3 12 -1 Not applicable Not applicable
2 1 27 1600 College Single
3 1 59 2000 Secondary Married/Cohabitating
3 2 51 1800 Other Married/Cohabitating
I have created a set of variables that record the characteristics of another member of the household.
So, for example, I wanted to have a variable for 'wage of partner' wage_p
for households with two married or cohabitating adults, which I obtained with
sums = tapply(data$wage, data$id, sum)
data$wage_tot = sums[match(data$id,names(sums))]
data$wage_tot[!(data$id %in% data$id[duplicated(data$id)])] = NA
data$wage_p = data$wage_tot - data$wage
Basically, I summed wage
per household obtaining wage_tot
and then subtracted wage
to obtain wage_p.
Which worked because I first restricted the dataset to married or cohabitating individuals (so I had 1 or 2 individuals per household). (I know this is probably more convoluted than necessary).
My results:
id num age wage edu marital_status wage_tot wage_p
1 1 33 1200 Secondary Married/Cohabitating 2300 1100
1 2 35 1100 College Married/Cohabitating 2300 1200
2 1 27 1600 College Single NA NA
3 1 59 2000 Secondary Married/Cohabitating 3800 1800
3 2 51 1800 Other Married/Cohabitating 3800 2000
Now the problem comes when I want to do this with categorical variables, because I cannot get a total and then subtract like I did with the continuous variables.
So for example, if I want to create a variable that records the level of education of the spouse, edu_p
.
id num age wage edu marital_status edu_p
1 1 33 1200 Secondary Married/Cohabitating College
1 2 35 1100 College Married/Cohabitating Secondary
2 1 27 1600 College Single NA
3 1 59 2000 Secondary Married/Cohabitating Other
3 2 51 1800 Other Married/Cohabitating Secondary
The only idea that I could come up with would be to transform the categorical variable in numeric, use my method, and then transform them again but i'm sure it is waay more complicated that it has to be.
Can anyone help me out?
Consider a merge
solution to compare each couple against each other using id
. Final left join merge
used to include not couple observations from original data.
spouse_merge <- subset(merge(data, data, by="id", suffixes=c("", "_p")),
(num < num_p | num > num_p) &
marital_status != "Not applicable" &
marital_status_p != "Not applicable")
final_df <- merge(data, spouse_merge[c(1,2, grep("_p", names(spouse_merge)))],
by=c("id", "num"), all.x=TRUE)
final_df