Search code examples
amazon-web-servicesamazon-redshiftpipeline

Using AWS Pipeline for incremental copy of on premises postgres to Redshift


We're exploring using Redshift for our warehouse and we need to move new data from our existing on-premises postgres db to Redshift nightly. It looks like you can do this with Pipeline and a template if your primary db is in RDS, but can you do it with Pipeline if your database is on-premises?


Solution

  • Since you are synching in bulk on a periodic basis you might want to consider pushing your data to S3, which is easy to do from on-premises, and then loading it into redshift via the COPY command. It's fast and reliable and you get a backup for free (cheap) in s3 as a side effect.

    You can use aws s3 cp from the awscli python tool to push up your CSV postgres dumps (after putting your access key in .aws/config) with something like:

    aws s3 cp current_dump.csv.gz s3://yourbucket/20170108/dump.csv.gz
    

    And then the regular postgres psql utility to execute the COPY on redshift like so:

    PGPASSWORD='YOURPASS' psql -h your.redshift.end.point.com -U youruser -d yourdb -p 5439 -c "COPY yourtable FROM 's3://yourbucket/20170108/dump.csv.gz' CREDENTIALS 'aws_access_key_id=[YOURKEY];aws_secret_access_key=[YOURSECRET]' DELIMITER ',' NULL 'NULL' IGNOREBLANKLINES EMPTYASNULL BLANKSASNULL TIMEFORMAT 'auto' FILLRECORD MAXERROR 1 CSV GZIP;"
    

    You could put your postgres dump and these commands in a shell script with a bit scripting of time values and run it as a cron job.

    For best practice you want to upload to a staging table and then merge to your main table to support updates if needed and prevent duplicates.