Search code examples
rdataframemergesasmerging-data

How do I merge 2 data frames on R based on 2 columns?


I am looking to merge 2 data frames based on 2 columns in R. The two data frames are called popr and dropped column, and they share the same 2 variables: USUBJID and TRTAG2N, which are the variables that I want to combine the 2 data frames by.

The merge function works when I am only trying to do it based off of one column:

merged <- merge(popr,droppedcol,by="USUBJID")

When I attempt to merge by using 2 columns and view the data frame "Duration", the table is empty and there are no values, only column headers. It says "no data available in table".

I am tasked with replicating the SAS code for this in R:

data duration;
  set pop combined1 ;
  by usubjid trtag2n;
run;

On R, I have tried the following

duration<- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")

duration <- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")

duration <- full_join(popr,droppedcol,by = c("USUBJID","TRTAG2N"))

duration <- merge(popr,droppedcol,by = c("USUBJID","TRTAG2N"))

I would like to see a data frame with the columns USUBJID, TRTAG2N, TRTAG2, and FUDURAG2, sorted by first FUDURAG2 and then USUBJID.


Solution

  • Per the SAS documentation, Combining SAS Data Sets, and confirmed by the SAS guru, @Tom, in comments above, the set with by simply means you are interleaving the datasets. No merge (which by the way is also a SAS method which you do not use) is taking place:

    Interleaving uses a SET statement and a BY statement to combine multiple data sets into one new data set. The number of observations in the new data set is the sum of the number of observations from the original data sets. However, the observations in the new data set are arranged by the values of the BY variable or variables and, within each BY group, by the order of the data sets in which they occur. You can interleave data sets either by using a BY variable or by using an index.

    Therefore, the best translation of set without by in R is rbind(), and set with by is rbind + order (on the rows):

    duration <- rbind(pop, combined1)                                # STACK DFs
    duration <- with(duration, duration[order(usubjid, trtag2n),])   # ORDER ROWS
    

    However, do note: rbind does not allow unmatched columns between the concatenated data sets. However, third-party packages allow for unmatched columns including: plyr::rbind.fill, dplyr::bind_rows, data.table::rbindlist.