I merged two datasets from the same survey. One is on the individual level selectedindividual
and one on the household level, selectedhousehold
. I have merged the two datasets using the following code (using left_join
from dplyr
):
mergeddf <- left_join(selectedhousehold, selectedindividual) %>% group_by(shserial) %>% slice (1)
The shserial
variable is the primary key present in both datasets. Each household has a specific number. Each household can contain up to two individuals, which will therefore have the same shserial
. Because I want to conduct my analysis on the household level, I did not want to just use merge()
on the datasets (as this duplicated the observations in households with 2 individuals, for, for example, one of my dependent variables on the household level - GrossIncome
).
I still have a problem though:
I have a variable on the individual level, WrkStat
(with three levels, working
, NWork
, FTEduc
) that I want to include in my model. The code that I used for merging the datasets only retained the first observations for two shserials with the same number (I assume this, at least, I could not figure out how slice()
works exactly). This is not great for my analysis as I don't want to select one of two individuals in a household at random. To illustrate, the summary statistics of WrkStat
in both the non-merged and the merged dataframes:
> summary(selectedindividual$WrkStat)
working FTEduc NWork NA's
324 748 2455 201
> summary(mergeddf$WrkStat)
working FTEduc NWork NA's
251 77 2097 5
As a solution, I figured I would create a new variable, WrkStat2
which combines the observations for two individuals in one household. I want to create this variable before merging the datasets.
I was hoping I could create this new variable on the basis of the mutual shserial
number.
However, I can't figure out how to do this.
EDIT:
The structure of my dataframe:
selectedindividual <- structure(list(`shserial` = c(1010574, 1010574,
1011104, 1011104, 1011109, 1011109, 1011134, 1011134, 1011142,
1011143, 1011148, 1011148, 1011154, 1011154, 1011156, 1011171,
1011171, 1011174, 1011174, 1011182), `WrkStat` = structure(c(3L,
2L, 3L, 2L, 3L, NA, 1L, NA, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 1L, 2L,
3L, NA, 3L), .Label = c("working", "FTEduc", "NWork"), class = "factor")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Gives this output:
shserial WrkStat
<dbl> <fct>
1 1010574 NWork
2 1010574 FTEduc
3 1011104 NWork
4 1011104 FTEduc
5 1011109 NWork
6 1011109 NA
7 1011134 working
8 1011134 NA
9 1011142 NWork
10 1011143 NWork
11 1011148 NWork
12 1011148 FTEduc
13 1011154 NWork
14 1011154 FTEduc
15 1011156 NWork
16 1011171 working
17 1011171 FTEduc
18 1011174 NWork
I would like this output:
shserial WrkStat2
<dbl> <fct>
1 1010574 NWork/FTEduc
2 1011104 NWork/FTEduc
3 1011109 NWork
4 1011134 working
5 1011142 NWork
6 1011143 NWork
7 1011148 NWork/FTEduc
8 1011154 NWork/FTEduc
9 1011156 NWork
10 1011171 working/FTEduc
11 1011174 NWork
(This also removes the NA's that are not on it's own (so not a combined WrkStat), although I think it would also be fine if all the NA's would be removed in this process).
Another edit:
WrkStat2
should have the following labels:
"working/working",
"working/NWork",
"working/FTEduc",
"NWork/NWork",
"NWork/FTEduc",
"FTEduc/FTEduc",
"working",
"NWork",
"FTEduc"
I'm sorry if something is not clear or if the whole things is impossible (let me know). I am not great at thinking logically and have struggled with this for a few days.
We can remove NA
values and paste
WrkStat
for each shserial
.
This can be done in base R :
aggregate(WrkStat~shserial, selectedindividual, function(x)
paste0(na.omit(x), collapse = "/"))
# shserial WrkStat
#1 1010574 NWork/FTEduc
#2 1011104 NWork/FTEduc
#3 1011109 NWork
#4 1011134 working
#5 1011142 NWork
#6 1011143 NWork
#7 1011148 NWork/FTEduc
#8 1011154 NWork/FTEduc
#9 1011156 NWork
#10 1011171 working/FTEduc
#11 1011174 NWork
#12 1011182 NWork
dplyr
:
library(dplyr)
selectedindividual %>%
group_by(shserial) %>%
summarise(WrkStat2 = paste0(na.omit(WrkStat), collapse = "/"))
Or in data.table
:
library(data.table)
setDT(selectedindividual)[, (WrkStat = paste0(na.omit(WrkStat), collapse = "/")),
shserial]