Search code examples
rbigdataouter-joinsqldfffbase

How can I perform full outer joins of large data sets in R?


I am trying to do data analysis in R on a group of medium sized datasets. One of the analyses I need to do requires me to do a full outer join amongst around 24-48 files, each of with has about 60 columns and up to 450,000 lines. So I've been running into memory issues a lot.

I thought at ffbase or sqldf would help, but apparently full outer join is not possible with either of them.

Is there a workaround? A package I haven't found yet?


Solution

  • Here is a simple example that illustrates how to do outer joins of several datasets:

    library(sqldf)
    dat1 <- data.frame(x = 1:5,y = letters[1:5])
    dat2 <- data.frame(w = 3:8,z = letters[3:8])
    > 
    > sqldf("select * from dat1 left outer join dat2 on dat1.x = dat2.w UNION 
    +       select * from dat2 left outer join dat1 on dat1.x = dat2.w")
      x y  w    z
    1 1 a NA <NA>
    2 2 b NA <NA>
    3 3 c  3    c
    4 4 d  4    d
    5 5 e  5    e
    6 6 f NA <NA>
    7 7 g NA <NA>
    8 8 h NA <NA>
    

    There it is, a full outer join using sqldf and SQLite as a backend.

    As I also mentioned, sqldf support more back ends than SQLite. A Google search reveals that full outer joins are accomplished the exact same way in MySQL. I am less familiar with postgres but this question suggests that full outer joins are possible there as well.