Search code examples
apache-sparkamazon-s3aws-glue

Write data to parquet format in parallel


I have a relatively huge on-premise table (~1.5 billion rows) which I am trying to pull into AWS S3 in parquet format using AWS Glue. I am using spark JDBC to read the table and write it to S3. The problem is that I cannot pull all the data from the source table in one go as the source DB would run out of memory and complain. To tackle that I am using the predicates option to push down filters in parallel which works fine to pull data in chunks of 200 million or so. But when I try to write this dataframe to S3 it takes almost half an hour to complete:

df = spark.read.jdbc(url=host_url, 
                        table="TABLENAME",
                        predicates=predicates,
                        properties= {
                            "user" : username,
                            "password" : password
                            }
                        )

So what I want to do is this read from DB stage in sequence:

Read Part 1 from DB --> Read Part 2 from DB --> Read Part 3 from DB

And then write all the data to S3 in parallel

Write Part 1 || Write Part 2 || Write Part 3 Two problems I have with this:

  1. I have no idea when Spark actually fires those queries to DB. I know it is not when I define the dataframe as shown above, so I am unable to figure out how to serialize stage 1.
  2. I have looked around and I could not find an option to write multiple dataframes to parquet partitions in parallel. Should I just use python to parallelize the data frame to parquet write operation statements? Is it even advisable to do so?

Solution

  • Spark reads the data as soon as an action is applied, since you are just reading and writing to s3 so data is read when the write is triggered.

    Spark is not optimized to read bulk data from rdbms as it establishes only single connection to the database. If you want to stick to the spark for reading try increasing the fetchsize property to 100000 default is 1000.

    For parallel processing of data you can try to leverage python multiprocessing and execute parallel read and writes

    Thread 1
    Read 1 -> Write 1
    Thread 2
    Read 2 -> Write 2
    

    But first try to execute is sequentially only

    Read 1 -> Write 1 -> Read 2 -> Write 2
    

    Another approach I would suggest is to get all your data to s3 at once using DMS or SCT.

    DMS can dump data in parquet format in s3 and will be very fast as it is optimized for migration tasks itself.

    If you don't want to use DMS, you can write a sqoop import job which can be triggered through a transient EMR cluster. Sqoop is also capable for importing data in parquet format.

    Glue is best for transforming the already existing data and for migrating large data you should take help of other services.