Search code examples
apache-sparkamazon-redshiftspark-redshift

How to optimize ETL data pipeline for fault tolerance when using Spark and Redshift?


I'm writing a big batch job using PySpark that ETLs 200 tables and loads into Amazon Redshift. These 200 tables are created from one input datasource. So the batch job is successful only when data is loaded into ALL 200 tables successfully. The batch job runs everyday while appending the data into tables for each date.

For fault tolerance, reliability and idempotency my current workflow follows:

  1. Using staging tables. Create temporary Redshift tables with CREATE TEMP TABLE LIKE <target_table>
  2. Transform and Load data into staging table.
  3. Repeat 1-2 for 200 other tables.
  4. Start BEGIN transaction.
  5. Copy staging table data into target table using INSERT INTO <taget_table> SELECT * FROM <staging_table>
  6. END Transaction
  7. DROP all staging tables.

This way I can guarantee that if Step 3 fails (which is more probable), I don't have to worry about removing partial data from original tables. Rather, I'll simply re-run entire batch job since temporary tables are discarded after the JDBC disconnection.

While it solved most of the problems it's not elegant, hackish and consumes extra time. I would like to if Spark and/or Redshift provides standard tools to solve this very common problem in ETL world.

Thanks


Solution

  • COPY commands can be in a transaction block. You just NEED:

    1. BEGIN
    2. COPY data to all the tables
    3. COMMIT (if successful)

    Redshift will maintain the previous version of the tables for all other viewers and their view of the tables won't change until COMMIT.

    The upside of the process you laid out is that during the time that the transaction is running other processes cannot get an exclusive locks on the tables (ALTER TABLE etc). Your inserts will run fast than the COPYs so the transaction will be open less time. This is only an issue if other processes are modifying the table at the same time the ETL is running which is generally not a good idea.