Search code examples
rdplyramazon-redshiftdbidbplyr

How do I convert a "tbl_df" "tbl" "data.frame" object to a "tbl_Redshift" "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl" object for uploading?


I am very new to using servers for data analysis, and I am completely baffled by what a "tbl_df""tbl""data.frame" object, and "tbl_Redshift" "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl" is in R? I know that something related to the Redshift server and tbl-lazy tables but that's it.

I have been following on a previous analyst's work, and I have used collect() to pull a bunch of data onto the local machine, doing some transformation before trying to upload it back to the server. However, I cannot upload the data using compute() for some reason, and I believe the reason is because the object itself became "tbl_df""tbl""data.frame" from "tbl_Redshift" "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl".

Is there a way to transform this back so I can upload it? Thanks in advance.


Solution

  • Let's begin by ensuring the core concepts are clear:

    • tbl_df, tbl, and data.frame are all local objects where the data exist in local R memory.
    • tbl_Redshift, tbl_dbi, and tbl_sql are all remote objects where the data exists in the database and R has a pointer/query/connection to the database in local R memory.

    The collect command retrieves data from a remote database into local R memory. Hence it converts an object of type tbl_Redshift, tbl_dbi, or tbl_sql into an object of type tbl_df, tbl, 0r data.frame.

    The compute command works for remote objects (tbl_Redshift, tbl_dbi, or tbl_sql) and saves an temporary/intermediate stage of a calculation to the database, creating a new remote object. This helps reduce additional computation. It can not be used to copy data from local R memory.

    Put another way:

    • collect : remote data input -> local data output
    • compute : remote data input -> remote date output

    It sounds like what you need is : local data input -> remote data output.

    For this, there are two broad options:

    1. Redesign your code to avoid collect. If the data is never copied from the server into local R memory then you never have to transfer it back.
    2. Use a function designed for copying data to a server. I use the DBI::dbWriteTable command for this purpose. Example function below.
    copy_from_r_to_sql <- function(db_conn, db, schema, sql_table_name, r_table_name) {
      DBI::dbWriteTable(
        db_conn,
        DBI::Id(
          catalog = db,
          schema = schema,
          table = sql_table_name
        ),
        r_table_name
      )
    }
    

    (This is a cut down version of one of my dbplyr helpers here)