I'm writing a big batch job using PySpark that ETLs 200 tables and loads into Amazon Redshift. These 200 tables are created from one input datasource. So the batch job is successful only when data is loaded into ALL 200 tables successfully. The batch job runs everyday while appending the data into tables for each date.
For fault tolerance, reliability and idempotency my current workflow follows:
CREATE TEMP TABLE LIKE <target_table>
BEGIN
transaction.INSERT INTO <taget_table> SELECT * FROM <staging_table>
END
TransactionDROP
all staging tables.This way I can guarantee that if Step 3 fails (which is more probable), I don't have to worry about removing partial data from original tables. Rather, I'll simply re-run entire batch job since temporary tables are discarded after the JDBC disconnection.
While it solved most of the problems it's not elegant, hackish and consumes extra time. I would like to if Spark and/or Redshift provides standard tools to solve this very common problem in ETL world.
Thanks
COPY commands can be in a transaction block. You just NEED:
Redshift will maintain the previous version of the tables for all other viewers and their view of the tables won't change until COMMIT.
The upside of the process you laid out is that during the time that the transaction is running other processes cannot get an exclusive locks on the tables (ALTER TABLE etc). Your inserts will run fast than the COPYs so the transaction will be open less time. This is only an issue if other processes are modifying the table at the same time the ETL is running which is generally not a good idea.