apache-spark pyspark apache-spark-sql parquet

Huge skewed data, Need to partition and convert to parquet

I have very huge data in a relational database(almost 70TB uncompressed) that need to be loaded in s3 and converted to parquet and partition the data by year-month, col1 and col2.

This would be daily job and i have 70Node cluster with 256GB Ram on each with 64Vcores. We are trying to use the spark to dump the data using the properitery connector and dumping data is super fast. It basically dumps into a temporary location in S3 in a CSV format in multiple chunks.Data is dumped in 1M chunks of 64mb csv files.

Without partitioning the data converting to parquet is being completed 3hrs including the unloading of data.

Problem Statement:

My data is highly skewed based on the requirements of partitioning as 70% data is in recent years and within that each col1 and col2 are also hugely skewed.

When i convert the files to parquet without partitioning, I am getting thousands of small files and also multiple tasks are failing with s3-slowdown request rate error. If i try to coalesce or repartition the data, I am getting the reshuffle/memory outage issues. I am trying to avoid reading the data in multiple iteration as the dumped data is un-partitioned, i might endup with reading whole 1M files everytime for filtering.

Is there a way to repartition(merge) files after partitioning ?

Solution

Thank you for all the comments. I am able resolve the issue and able to achieve the requirement.

I have seperated extract from Redshift into a seperate process. As there are multiple issues with the JDBC driver for Redshift. i have used a native unload command on redshift which currently supports Parquet format. So extracted data in parquet format and write into s3. It have reduced my datasize from 60TB to 9TB.
Written a custom partitioner(salting) method and came up with a algorthim that would equally distribute the data.(atleast would be equal and break up huge partitions into equal chunks).
Before writing, I am repartition the data using custom partitioner and writing the data.

The entire process now completes in 4-5hrs.