Search code examples
rdatabasedplyrsrcdbplyr

Joining across databases with dbplyr


I am working with database tables with dbplyr

I have a local table and want to join it with a large (150m rows) table on the database

The database PRODUCTION is read only

# Set up the connection and point to the table

library(odbc); library(dbplyr)    

my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;DATABASE=PRODUCTION;UID=",
                            t2690_username,";PWD=",t2690_password, sep="")

t2690 <- dbConnect(odbc::odbc(), .connection_string=my_conn_string)

order_line <- tbl(t2690, "order_line") #150m rows

I also have a local table, let's call it orders

# fill df with random data

orders <- data.frame(matrix(rexp(50), nrow = 100000, ncol = 5))

names(orders) <- c("customer_id", paste0(rep("variable_", 4), 1:4))

let's say I wanted to join these two tables, I get the following error:

complete_orders <- orders %>% left_join(order_line)

> Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow)

The issue is, if I were to set copy = TRUE, it would try to download the whole of order_line and my computer would quickly run out of memory

Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.

The only solution I have found is to upload into the writable database and use sql to join them, which is far from ideal


Solution

  • I have found the answer, you can use the in_schema() function within the tbl pointer to work across schemas within the same connection

    # Connect without specifying a database
    my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;UID=",
                                t2690_username,";PWD=",t2690_password, sep="")    
    
    # Upload the local table to the TEMP db then point to it
    orders <- tbl(t2690, in_schema("TEMP", "orders"))
    
    order_line <-  tbl(t2690, in_schema("PRODUCTION", "order_line"))
    
    complete_orders <- orders %>% left_join(order_line)