Search code examples
rpostgresqlodbcgoogle-cloud-sql

Fastest way to upload data via R to PostgresSQL 12


I am using the following code to connect to a PostgreSQL 12 database:

con <- DBI::dbConnect(odbc::odbc(), driver, server, database, uid, pwd, port)

This connects me to a PostgreSQL 12 database on Google Cloud SQL. The following code is then used to upload data:

DBI::dbCreateTable(con, tablename, df)
DBI::dbAppendTable(con, tablename, df)

where df is a data frame I have created in R. The data frame consists of ~ 550,000 records totaling 713 MB of data.

When uploaded by the above method, it took approximately 9 hours at a rate of 40 write operations/second. Is there a faster way to upload this data into my PostgreSQL database, preferably through R?


Solution

  • I've always found bulk-copy to be the best, external to R. The insert can be significantly faster, and your overhead is (1) writing to file, and (2) the shorter run-time.

    Setup for this test:

    • win10 (2004)
    • docker
    • postgres:11 container, using port 35432 on localhost, simple authentication
    • a psql binary in the host OS (where R is running); should be easy with linux, with windows I grabbed the "zip" (not installer) file from https://www.postgresql.org/download/windows/ and extracted what I needed
    • I'm using data.table::fwrite to save the file because it's fast; in this case write.table and write.csv are still much faster than using DBI::dbWriteTable, but with your size of data you might prefer something quick
    DBI::dbCreateTable(con2, "mt", mtcars)
    DBI::dbGetQuery(con2, "select count(*) as n from mt")
    #   n
    # 1 0
    
    z1000 <- data.table::rbindlist(replicate(1000, mtcars, simplify=F))
    nrow(z1000)
    # [1] 32000
    system.time({
      DBI::dbWriteTable(con2, "mt", z1000, create = FALSE, append = TRUE)
    })
    #    user  system elapsed 
    #    1.56    1.09   30.90 
    
    system.time({
      data.table::fwrite(z1000, "mt.csv")
      URI <- sprintf("postgresql://%s:%s@%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
      system(
        sprintf("psql.exe -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
                "mt", paste(colnames(z1000), collapse = ","),
                sQuote("mt.csv"), URI)
      )
    })    
    # COPY 32000
    #    user  system elapsed 
    #    0.05    0.00    0.19 
    DBI::dbGetQuery(con2, "select count(*) as n from mt")
    #       n
    # 1 64000
    

    While this is a lot smaller than your data (32K rows, 11 columns, 1.3MB of data), a speedup from 30 seconds to less than 1 second cannot be ignored.


    Side note: there is also a sizable difference between dbAppendTable (slow) and dbWriteTable. Comparing psql and those two functions:

    z100 <- rbindlist(replicate(100, mtcars, simplify=F))
    
    system.time({
      data.table::fwrite(z100, "mt.csv")
      URI <- sprintf("postgresql://%s:%s@%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
      system(
        sprintf("/Users/r2/bin/psql -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
                "mt", paste(colnames(z100), collapse = ","),
                sQuote("mt.csv"), URI)
      )
    })
    # COPY 3200
    #    user  system elapsed 
    #     0.0     0.0     0.1 
    
    system.time({
      DBI::dbWriteTable(con2, "mt", z100, create = FALSE, append = TRUE)
    })
    #    user  system elapsed 
    #    0.17    0.04    2.95 
    
    system.time({
      DBI::dbAppendTable(con2, "mt", z100, create = FALSE, append = TRUE)
    })
    #    user  system elapsed 
    #    0.74    0.33   23.59 
    

    (I don't want to time dbAppendTable with z1000 above ...)

    (For kicks, I ran it with replicate(10000, ...) and ran the psql and dbWriteTable tests again, and they took 2 seconds and 372 seconds, respectively. Your choice :-) ... now I have over 650,000 rows of mtcars ... hrmph ... drop table mt ...