Search code examples
postgresqlrodbcr-dbi

Quickest way to insert several hundred batches of 500k - 3 million records to PostgreSQL database


I have a PostgreSQL server hosted on a VM on our intranet. Elsewhere within the network (these machines do not have access to the same filesystem) I have ~850 R scripts running (simultaneously) that are each generating anywhere between 500k and 3 million rows of data that I want to push to a (currently empty) table in the database. I am trying to determine what is the optimal way to do so. I came up with these solutions so far:

  1. Use the R odbc and DBI::dbWriteTable packages/functions to push the data to the remote server. As a test, I tried taking this route for 3 (out of ~850) batches simultaneously and that took ~25 minutes.

  2. Export the data as a text file, use SSH to copy text file to the server hosting the database, and use COPY to import the data

Are there better ways to do this? What is the best way to do this?

The table that I am trying to push to is indexed and has foreign keys, as well as unique constraints. I've read somewhere that removing these prior to adding the data, and adding them back in afterwards could significantly speed things up. Would this make a significant difference?


Solution

  • From my experience with huge PG-databases there is nothing faster than a COPY statement at the PG host itself. If you put some fast gzip or so on the dump files, you will speed up the transfer to that host as well in a significant way. On top you should disable all indexes, and enable them once the database table was filled. So yes, your (2) is my favorite, combined with disabled indexes at import time