Search code examples
csvremote-accesstalendgreenplum

csv data load from remote unix server to another remote server postgres


I have a csv file which is situated on a remote unix server. I need to put that data load into a postgres db (greenplum) which is currently situated on another remote server.

Currently, I am pulling the csv into my local drive with winscp, and then loading it into greenplum remote using pgadmin with a local copy.

This seems to be a circuitous method of pulling the data into a local machine to then put it into greenplum. It is taking a long time (>100 hrs)

I think there must be a way to bulkload the remote csv to remote greenplum db without a local intervention. Has anyone some experience with this kind of data migration? I am using talend for the ETL.

Thanks!


Solution

  • Yes, there is a bulk load way to load that data from the remote server to Greenplum. It is significantly faster too.

    Your Talend server will need to be networked so that it can communicate with the segment hosts in your cluster. Here is a guide on how the network should be configured: http://gpdb.docs.pivotal.io/4380/admin_guide/intro/about_loading.html

    You can then use "gpload" to load the data. This is a utility that automates the tasks of starting a gpfdist process, creating an external table and performing an INSERT statement for you. Documentation on gpload: http://gpdb.docs.pivotal.io/4380/utility_guide/admin_utilities/gpload.html#topic1

    Lastly, Talend is a Pivotal partner and they have lots of documentation on how to use their tools to load data into Greenplum. It leverages gpfdist to load data in parallel to the database just like gpload.