I have very huge data in a relational database(almost 70TB uncompressed) that need to be loaded in s3 and converted to parquet and partition the data by year-month, col1 and col2.
This would be daily job and i have 70Node cluster with 256GB Ram on each with 64Vcores. We are trying to use the spark to dump the data using the properitery connector and dumping data is super fast. It basically dumps into a temporary location in S3 in a CSV format in multiple chunks.Data is dumped in 1M chunks of 64mb csv files.
Without partitioning the data converting to parquet is being completed 3hrs including the unloading of data.
Problem Statement:
My data is highly skewed based on the requirements of partitioning as 70% data is in recent years and within that each col1 and col2 are also hugely skewed.
When i convert the files to parquet without partitioning, I am getting thousands of small files and also multiple tasks are failing with s3-slowdown request rate error. If i try to coalesce or repartition the data, I am getting the reshuffle/memory outage issues. I am trying to avoid reading the data in multiple iteration as the dumped data is un-partitioned, i might endup with reading whole 1M files everytime for filtering.
Is there a way to repartition(merge) files after partitioning ?
Thank you for all the comments. I am able resolve the issue and able to achieve the requirement.
The entire process now completes in 4-5hrs.