I am new to Spark 2.4 and trying to figure out the best approach for bringing data from GreenPlum/PostgreSQL to Spark.
Is it better to load data from RDBMS using a join query or individually load tables using partition by DB column and then have spark joins?
A native query with joins across multiple huge tables works with the below code but not sure whether it is a correct approach or not also not sure how we can give a partition.
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:postgresql://localhost:5432/test")
.option("user", "user1")
.option("password", "password1")
.option("query", "select a.*, b.* from a join on b a.user_id = b.user_id)
//not sure what should go here for query with multiple joins.
//.option("partitionColumn", "*")
I guess that the fastest option is to first load the data in parallel from RDBMS to the cluster storage: HDFS or S3 for example. You will do it similarly to your current load statement, but you will define partitions for the table first using the properties partitionColumn, lowerBound, upperBound
(see https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html). This is very fast in general. The only drawback is that you need a numeric column which has equally distributed values in the best case: auto incrementing ids or timestamps are generally quite good.
Computationally heavy operations on really huge tables are often way faster if performed using scalable engines like spark.