Search code examples
rmergecategorical-data

Combining observations into a new variable based on a mutual variable


I merged two datasets from the same survey. One is on the individual level selectedindividual and one on the household level, selectedhousehold. I have merged the two datasets using the following code (using left_join from dplyr):

mergeddf <- left_join(selectedhousehold, selectedindividual)  %>% group_by(shserial)  %>% slice (1)

The shserial variable is the primary key present in both datasets. Each household has a specific number. Each household can contain up to two individuals, which will therefore have the same shserial. Because I want to conduct my analysis on the household level, I did not want to just use merge() on the datasets (as this duplicated the observations in households with 2 individuals, for, for example, one of my dependent variables on the household level - GrossIncome).

I still have a problem though:

I have a variable on the individual level, WrkStat (with three levels, working, NWork, FTEduc) that I want to include in my model. The code that I used for merging the datasets only retained the first observations for two shserials with the same number (I assume this, at least, I could not figure out how slice() works exactly). This is not great for my analysis as I don't want to select one of two individuals in a household at random. To illustrate, the summary statistics of WrkStat in both the non-merged and the merged dataframes:

> summary(selectedindividual$WrkStat)
working  FTEduc   NWork    NA's 
    324     748    2455     201 

> summary(mergeddf$WrkStat)
working  FTEduc   NWork    NA's 
    251      77    2097       5 

As a solution, I figured I would create a new variable, WrkStat2 which combines the observations for two individuals in one household. I want to create this variable before merging the datasets.

I was hoping I could create this new variable on the basis of the mutual shserial number. However, I can't figure out how to do this.

EDIT:

The structure of my dataframe:

selectedindividual <- structure(list(`shserial` = c(1010574, 1010574, 
1011104, 1011104, 1011109, 1011109, 1011134, 1011134, 1011142, 
1011143, 1011148, 1011148, 1011154, 1011154, 1011156, 1011171, 
1011171, 1011174, 1011174, 1011182), `WrkStat` = structure(c(3L, 
2L, 3L, 2L, 3L, NA, 1L, NA, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 1L, 2L, 
3L, NA, 3L), .Label = c("working", "FTEduc", "NWork"), class = "factor")), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

Gives this output:

   shserial WrkStat
      <dbl> <fct>  
 1  1010574 NWork  
 2  1010574 FTEduc 
 3  1011104 NWork  
 4  1011104 FTEduc 
 5  1011109 NWork  
 6  1011109 NA     
 7  1011134 working
 8  1011134 NA     
 9  1011142 NWork  
10  1011143 NWork  
11  1011148 NWork  
12  1011148 FTEduc 
13  1011154 NWork  
14  1011154 FTEduc 
15  1011156 NWork  
16  1011171 working
17  1011171 FTEduc 
18  1011174 NWork 

I would like this output:

   shserial WrkStat2
      <dbl> <fct>  
 1  1010574 NWork/FTEduc  
 2  1011104 NWork/FTEduc 
 3  1011109 NWork     
 4  1011134 working
 5  1011142 NWork  
 6  1011143 NWork  
 7  1011148 NWork/FTEduc
 8  1011154 NWork/FTEduc 
 9  1011156 NWork  
10  1011171 working/FTEduc
11  1011174 NWork 

(This also removes the NA's that are not on it's own (so not a combined WrkStat), although I think it would also be fine if all the NA's would be removed in this process).

Another edit:

WrkStat2 should have the following labels:

"working/working",
"working/NWork",
"working/FTEduc",
"NWork/NWork",
"NWork/FTEduc",
"FTEduc/FTEduc",
"working",
"NWork",
"FTEduc"

I'm sorry if something is not clear or if the whole things is impossible (let me know). I am not great at thinking logically and have struggled with this for a few days.


Solution

  • We can remove NA values and paste WrkStat for each shserial.

    This can be done in base R :

    aggregate(WrkStat~shserial, selectedindividual, function(x) 
               paste0(na.omit(x), collapse = "/"))
    
    #   shserial        WrkStat
    #1   1010574   NWork/FTEduc
    #2   1011104   NWork/FTEduc
    #3   1011109          NWork
    #4   1011134        working
    #5   1011142          NWork
    #6   1011143          NWork
    #7   1011148   NWork/FTEduc
    #8   1011154   NWork/FTEduc
    #9   1011156          NWork
    #10  1011171 working/FTEduc
    #11  1011174          NWork
    #12  1011182          NWork
    

    dplyr :

    library(dplyr)
    selectedindividual %>% 
       group_by(shserial) %>% 
       summarise(WrkStat2 = paste0(na.omit(WrkStat), collapse = "/"))
    

    Or in data.table :

    library(data.table)
    setDT(selectedindividual)[, (WrkStat = paste0(na.omit(WrkStat), collapse = "/")), 
                                shserial]