Weird Parquet Write Bottleneck

I'm sorta new to spark. I currently am getting some very weirdly slow parquet writing to amazon s3 after my spark calc finishes.

It took 1.8 hours to write a small file (had 2 partitions when writing)

I ran the same spark calc with a different LARGER file (more rows + more columns) (had 3 partitions when writing)

The write call itself: df.write.mode("overwrite").parquet(key)

I tried looking at the SQL plans and they don't look any different. Even if the slowness is from file differences, I would not expect one to be <1min and the other >1.5h.

For my slow file, I took out the parquet writes and the total calc time went from 2.6 hrs -> 1 hr, so I don't think it was doing lazy eval at the end which caused it to slow.

Do you guys have suggestions on what to investigate? I tried checking out the DAG and the SQL tab of the history server and I don't see anything that stands out. # of executors was the same. The main diff I see the bigger and faster file had 3 tasks when writing parquet, but each task processed more rows and bytes than the smaller slower file.

Solution

believe the issue is that I didn't realize it was doing lazy eval.