Search code examples
rdatabasedbidbplyr

How to use DBI::dbConnect() to read and write tables from multiple databases


I have a Netezza SQL server I connect to using DBI::dbConnect. The server has multiple databases we will name db1 and db2.

I would like to use dbplyr as much as possible and skip having to write SQL code in RODBC::sqlQuery(), but I am not sure how to do the following:.

1) How to read a table in db1, work on it and have the server write the result into a table in db2 without going through my desktop?
2) How to do a left join between a table in db1 and another in db2 ?

It looks like there might be a way to connect to database ="SYSTEM" instead of database = "db1" or "db2", but I am not sure what a next step would be.

con <- dbConnect(odbc::odbc(),
             driver = "NetezzaSQL",
             database = "SYSTEM",
             uid = Sys.getenv("netezza_username"),
             pwd = Sys.getenv("netezza_password"),
             server = "NETEZZA_SERVER",
             port = 5480)

Solution

  • I work around this problem on SQL server using in_schema and dbExecute as follows. Assuming Netezza is not too different.

    Part 1: shared connection

    The first problem is to connect to both tables via the same connection. If we use a different connection then joining the two tables results in data being copied from one connection to the other which is very slow.

    con <- dbConnect(...) # as required by your database
    
    table_1 <- dplyr::tbl(con, from = dbplyr::in_schema("db1", "table_name_1"))
    table_2 <- dplyr::tbl(con, from = dbplyr::in_schema("db2.schema2", "table_name_2"))
    

    While in_schema is intended for passing schema names you can also use it for passing the database name (or both with a dot in between).

    The following should now work without issue:

    # check connection
    head(table_1)
    head(table_2)
    
    # test join code
    left_join(table_1, table_2, by = "id") %>% show_query()
    # check left join
    left_join(table_1, table_2, by = "id") %>% head()
    

    Part 2: write to datebase

    A remote table is defined by two things

    1. The connection
    2. The code of the current query (e.g. the result of show_query)

    We can use these with dbExecute to write to the database. My example will be with SQL server (which uses INTO as the keyword, you'll have to adapt to your own environment if the sql syntax is different).

    # optional, extract connection from table-to-save
    con <- table_to_save$src$con
    
    # SQL query
    sql_query <- paste0("SELECT *\n",
                        "INTO db1.new_table \n", # the database and name you want to save
                        "FROM (\n",
                        dbplyr::sql_render(table_to_save),
                        "\n) subquery_alias")
    # run query
    dbExecute(con, as.character(sql_query))
    

    The idea is to create a query that can be executed by the database that will write the new table. I have done this by treating the existing query as a subquery of the SELECT ... INTO ... FROM (...) subquery_alias pattern.

    Notes:

    • If the sql query produced by show_query or sql_render would work when you access the database directly then the above should work (all that changes is the command is arriving via R instead of via the sql console).
    • The functions I have written to smooth this process for me can be found on here. They also include appending, deleting, compressing, indexing, and handling views.
    • Writing a table via dbExecute will error if the table already exists in the database, so I recommend checking for this first.
    • I use this work around in other places, but inserting the database name with in_schema has not worked for creating views. To create (or delete) a view I have to ensure the connection is to the database where I want the view.