Search code examples
rdataframeinner-joinmultiple-columns

Merge Multiple Datasets with Identical Column Names in R While Attaching a Suffix to Each Column to Identify its Dataset


I am trying to merge3 datasets, which are part of longitudinal study. The variable names are a identical across the waves of the study. What I want to accomplish is to attach a suffix to each variable name to indicate the dataset it belongs to while merging them. I have been able to do that in a roundabout way, but I am thinking there should be an easier way to accomplish that. Below is what I have done so far, just using some fictitious data to illustrate.

## step 1: create three dfs with identical variable names
df1 = data.frame('ID' = 1:10, 'V1' = sample(x=1:5, size=10, replace=T), 'v2' = sample(x=1:5, size=10, replace=T))
df2 = data.frame('ID' = 1:10, 'V1' = sample(x=1:5, size=10, replace=T), 'v2' = sample(x=1:5, size=10, replace=T))
df3 = data.frame('ID' = 1:10, 'V1' = sample(x=1:5, size=10, replace=T), 'v2' = sample(x=1:5, size=10, replace=T))

## step 2: function to rename all columns in a dataset
colsRename <- function(data, nn){
  
  nn = nn
  df = data %>% 
    rename_with(~ paste0(colnames(data), paste0('_'), nn)[which(colnames(data) == .x)], 
                .cols = colnames(data)) %>% 
    glimpse()
  
  return(df)
  
}

## call function to rename columns
w1 = colsRename(df1, 'W1') 
w2 = colsRename(df2, 'W2') 
w3 = colsRename(df3, 'W3') 

# merge the dataframes
dfMerge = merge(w1, w2, by.x = c("ID_W1"), by.y = c("ID_W2"), all = FALSE) %>% 
  merge(., w3, by.x = c("ID_W1"), by.y = c("ID_W3"), all = FALSE) %>% 
  glimpse()

The final output for the joined/merged dataframes will look something like below

   ID_W1 V1_W1 v2_W1 V1_W2 v2_W2 V1_W3 v2_W3
1      1     3     4     1     5     5     1
2      2     4     2     5     4     5     2
3      3     4     3     2     4     1     3
4      4     5     1     1     2     5     1
5      5     5     1     3     1     5     1
6      6     4     4     3     4     3     5
7      7     2     5     3     2     3     2
8      8     1     1     2     1     2     2
9      9     5     3     2     2     1     3
10    10     5     2     5     3     4     5

Solution

  • You can put the dataframe in a named list, rename the dataframes and use reduce -

    library(dplyr)
    library(purrr)
    
    list(W1 = df1, W2 = df2, W3 = df3) %>%
      imap(function(x, y) x %>% rename_with(~paste(., y, sep = '_'), -ID)) %>%
      reduce(inner_join, by = 'ID')
    
    #   ID V1_W1 v2_W1 V1_W2 v2_W2 V1_W3 v2_W3
    #1   1     1     1     5     2     5     5
    #2   2     3     2     4     4     3     3
    #3   3     4     3     5     4     1     2
    #4   4     1     4     2     3     4     5
    #5   5     3     5     1     1     1     5
    #6   6     5     5     1     2     1     3
    #7   7     4     3     3     1     3     2
    #8   8     2     1     1     2     4     2
    #9   9     5     2     5     4     1     2
    #10 10     1     5     1     5     3     4
    

    If there are more dataframe and you don't want to name them individually you can collect all the dataframes from the global environment using mget.

    mget(ls(pattern = 'df\\d+')) %>%
      imap(function(x, y) x %>% rename_with(~paste(., y, sep = '_'), -ID)) %>%
      reduce(inner_join, by = 'ID')