I have a relatively huge on-premise table (~1.5 billion rows) which I am trying to pull into AWS S3 in parquet format using AWS Glue. I am using spark JDBC to read the table and write it to S3. The problem is that I cannot pull all the data from the source table in one go as the source DB would run out of memory and complain. To tackle that I am using the predicates option to push down filters in parallel which works fine to pull data in chunks of 200 million or so. But when I try to write this dataframe to S3 it takes almost half an hour to complete:
df = spark.read.jdbc(url=host_url,
table="TABLENAME",
predicates=predicates,
properties= {
"user" : username,
"password" : password
}
)
So what I want to do is this read from DB stage in sequence:
Read Part 1 from DB --> Read Part 2 from DB --> Read Part 3 from DB
And then write all the data to S3 in parallel
Write Part 1 || Write Part 2 || Write Part 3
Two problems I have with this:
Spark reads the data as soon as an action is applied, since you are just reading and writing to s3 so data is read when the write is triggered.
Spark is not optimized to read bulk data from rdbms as it establishes only single connection to the database. If you want to stick to the spark for reading try increasing the fetchsize property to 100000 default is 1000.
For parallel processing of data you can try to leverage python multiprocessing and execute parallel read and writes
Thread 1
Read 1 -> Write 1
Thread 2
Read 2 -> Write 2
But first try to execute is sequentially only
Read 1 -> Write 1 -> Read 2 -> Write 2
Another approach I would suggest is to get all your data to s3 at once using DMS or SCT.
DMS can dump data in parquet format in s3 and will be very fast as it is optimized for migration tasks itself.
If you don't want to use DMS, you can write a sqoop import job which can be triggered through a transient EMR cluster. Sqoop is also capable for importing data in parquet format.
Glue is best for transforming the already existing data and for migrating large data you should take help of other services.