Search code examples
rdplyrsparklyr

Is it possible to do a full join in dplyr and keep all the columns used in the join?


I have two tables that I want to do a full join using dplyr, but I don't want it to drop any of the columns. Per the documentation and my own experience it is only keeping the join column for the left hand side. This is a problem when you have a row with a record for the right hand side since the join value is gone.

For example, suppose I have the two tables a and b,

customerId | revenue               customerId | state
-----------|---------              -----------|-------
    1      | 2000                       1     |  CA
    2      | 3000                       3     |  GA
    4      | 4000                       4     |  NY

doing something like full_join(a, b, by="customerId") will produce

customerId | revenue | state
-----------|---------|-------
    1      |   2000  |  CA
    2      |   3000  | <NA>
   <NA>    |   <NA>  |  GA
    4      |   4000  |  NY

so there is no way to tell which customer that third row is from. The ideal output would be

customerId.a | customerId.b | revenue | state
-------------|--------------|---------|-------
      1      |      1       |   2000  |  CA
      2      |     <NA>     |   3000  | <NA>
    <NA>     |      3       |   <NA>  |  GA
      4      |      4       |   4000  |  NY

note that this is just a toy example. I'm actually using sparklyr so this is all being run in Spark. Thus, merge won't work here for me. Is there a way to do what I'm looking for in dplyr?

EDIT: As someone pointed out this actually is working as desired in dplyr itself locally. However, I do see this problem using sparklyr (which uses dplyr). Here is the code to see that:

library(sparklyr)
sc <- spark_connect("local[4]")
d1 <- data_frame(customerId = c("1","2","4"), revenue=c(2000,3000,4000))
d2 <- data_frame(customerId = c("1","3","4"), state=c("CA", "GA", "NY"))
d1_tbl <- copy_to(sc, d1)
d2_tbl <- copy_to(sc, d2)
full_join(d1_tbl, d2_tbl, by=c("customerId"))

Solution

  • I can't reproduce your problem. All the IDs should be (and are) included in the full join.

    library(data_frame)
    d1 <- data_frame(
      customerId = c(1, 2, 4),
      revenue = c(2000, 3000, 4000)
    )
    d2 <- data_frame(
      customerId = c(1, 3, 4),
      state = c("CA", "GA", "NY")
    )
    
    full_join(d1, d2, by = "customerId")
    ## # A tibble: 4 × 3
    ##   customerId revenue state
    ##        <dbl>   <dbl> <chr>
    ## 1          1    2000    CA
    ## 2          2    3000  <NA>
    ## 3          4    4000    NY
    ## 4          3      NA    GA
    

    Update: I can reproduce the problem using sparklyr. It's weird behavior, so you might want to file an issue. (Unclear whether the problem is with sparklyr or dplyr or DBI or Spark SQL though.)

    Using explain(), you can see the SQL that was generated.

    full_join(d1_tbl, d2_tbl, by=c("customerId")) %>% explain()
    

    You can try running a custom SQL query to get what you want, though it's a little messier.

    library(DBI)
    qry <- "SELECT 
        d1.customerID AS customerID1, 
        d2.customerID AS customerID2, 
        d1.revenue, 
        d2.state 
      FROM d1 
      FULL JOIN d2 
        ON d1.customerId = d2.customerId"
    dbGetQuery(sc, qry)  
    ##   customerID1 customerID2 revenue state
    ## 1           1           1    2000    CA
    ## 2           2        <NA>    3000  <NA>
    ## 3        <NA>           3     NaN    GA
    ## 4           4           4    4000    NY