Search code examples
rdataframeloopsmergeunique

Merge two dataframes based upon unique values in one column using a for loop


I have two dataframes that both contain a column entitled "new_id" There are 100's, sometimes even 1000's of rows for each unique value of "new_id" in both data frames

I want to merge the two data frames based on this column but if I take a brute force approach using:

df3 <- merge(df1, df2, by = "new_id")

I run into memory issues.

I created a subset of df2 consisting of just one new_id value and merged it with df1 with success. So, I think I've come up with a solution using a for loop. However, I am not sure how to construct it since I am at a beginner level with R.

What I think the loop should do is first subset df2 for each unique value of new_id and then merge that subsetted data frame with df1, storing the result in a list. With each iteration of the loop, the list would append each new dataframes for each new_id. At the end, I would rbind the list into its own "master" dataframe.

I don't know if this is correct but I've come up with this loop as the basic structure and potential a starting point:

datalist = list()

for(i in unique(Procs_OP$new_id)){
    df <- subset(Procs_OP, new_id %in% Procs_OP$new_id))
    df2 <- merge(step2, df, by="new_id", all.step2=TRUE)    
    datalist[[i]] <- df2
}

df3 = do.call(rbind, datalist)

Solution

  • The subset should be on i as the i here is unique element looped

    un1 <- unique(Procs_OP$new_id)
    datalist <- vector('list', length(un1))
    names(datalist) <- 
    for(i in un1){
        df <- subset(Procs_OP, new_id %in% i))
        df2 <- merge(step2, df, by="new_id", all.step2=TRUE)    
        datalist[[i]] <- df2
    }