postgresql apache-spark pyspark jdbc apache-spark-sql

Spark JDBC Postgresql Overwrite job - minimise downtime of target table

I want to overwrite a Postgresql table transactionStats which is used by the customer facing dashboards.

This table needs to be updated every 30 mins. I will write a AWS Glue Spark job via JDBC connection to perform this operation.

Spark dataframe write snippet -

df.write.format("jdbc").option("url", "connection_params").option("dbtable","public.transactionStats"). mode("overwrite").save()

My question is - if i overwrite this table will it cause downtime to the customers while the table is being overwritten?

If yes, how to avoid downtime? Any solutions?

Solution

Yes when overwriting with spark SQL datasource you will affect your users query using the table.

By design spark jdbc driver is meant for parallel insert (see numPartitions parrameter) and thus the executors have their own connection to the database, while the drivers runs ddl (such drop table, truncate, create)

As a result there is multiple connections involved during an overwrite table, and this cannot be done in a single transaction (as it could be done with a regular sql script), leading to unexpected behavior for users.

Keep in mind you would also loose the indexes on the table after the process. You could use the overwrite mode together with truncate option to avoid dropping the table, but again you would have a downtime see https://issues.apache.org/jira/browse/SPARK-32595

A alternate design would be to load your data in a temporary table from scratch, create the indexes if needed, and then in a single transaction with a jdbc statement, drop the previous table and rename the temporary one to replace the previous.

Last advice, spark jdbc itself is not particularly good at dealing with postgresql, because it relies on regular jdbc data transfers with lot of serialization/deserialization. For very large dataset prefer the postgres copy from / copy to statements based on CSV. And CSV is a topic where spark shines.