Search code examples
rdplyranti-join

Master view of multiple dataframes with common columns


I have three dataframes like below:

df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','B','C','E','I'),col2=c(4,6,8,2,9))
df1 <- data.frame(col1=c('A','D','C','E','I'),col2=c(4,7,8,2,9))

The differences between any two files could be as below:

anti_join(df2, df3)
# Joining, by = c("col1", "col2")
#   col1 col2
# 1    B    6
# 2    I    9

anti_join(df3, df2)
# Joining, by = c("col1", "col2")
# [1] col1 col2
# <0 rows> (or 0-length row.names)

anti_join(df1, df2)
# Joining, by = c("col1", "col2")
#   col1 col2
# 1    D    7

anti_join(df2, df1)
# Joining, by = c("col1", "col2")
#   col1 col2
# 1    B    6

I would like to create a master dataframe with all the values in col1 and col2 specific to each dataframe. If there is no such value present, it should populate NA.

  col1 df1_col2 df2_col2 df3_col2
1    A        4        4        4 
2    B       NA        6       NA  
3    C        8        8        8
4    E        2        2        2 
5    I        9        9       NA
6    D        7       NA       NA

The essence of the above output could be established from the above anti_join commands. However, it does not provide the complete picture at once. Any thoughts on how to achieve this?

Edit: For multiple values in col2 for col1, the output is a little messier. For example, A has values 4, 3.

df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','A','B','C','E','I'),col2=c(4,3,6,8,2,9))
df1 <- data.frame(col1=c('A','A','D','C','E','I'),col2=c(4,3,7,8,2,9))

lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
lst_of_frames %>%
  imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
  reduce(full_join, by = "col1")

It gives the below output.

#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    A        4        3        4
# 3    A        3        4        4
# 4    A        3        3        4
# 5    D        7       NA       NA
# 6    C        8        8        8
# 7    E        2        2        2
# 8    I        9        9       NA
# 9    B       NA        6       NA

The interesting part of the output is:

#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    A        4        3        4
# 3    A        3        4        4
# 4    A        3        3        4

whereas the expected output is:

#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    A        3        3       NA

Solution

  • Similar to @tamtam's answer, but a little programmatic if you have a dynamic list of frames.

    lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
    # lst_of_frames <- tibble::lst(df1, df2, df3)    # thanks, @user63230
    library(dplyr)
    library(purrr)  # imap, reduce
    lst_of_frames %>%
      imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
      reduce(full_join, by = "col1")
    #   col1 df1_col2 df2_col2 df3_col2
    # 1    A        4        4        4
    # 2    D        7       NA       NA
    # 3    C        8        8        8
    # 4    E        2        2        2
    # 5    I        9        9       NA
    # 6    B       NA        6       NA
    

    It's important (for automatically renaming the columns) that the list-of-frames be a named list; my assumption was the name of the frame variable list(df1=df1), but it could just as easily be list(A=df1) to produce a column named A_col2 in the end.