I have a process which reads hive(parquet-snappy) table and builds a dataset of 2GB. It is iterative(~ 7K) process and This dataset is going to be the same for all iterations so I decided to cache the dataset.
Somehow cache task is done on one executor only and seems like the cache is on that one executor only. which leads in delay, OOM etc.
Is it because of parquet? How to make sure that cache is distributed on multiple executors?
Here is the spark config:
tried repartition and adjusting config but no luck.
I am answering my own question but it is interesting finding and It's worth sharing as @thebluephantom suggested.
So here the situation was in spark code I was reading data from 3 hive parquet tables and building the dataset. Now in my case, I am reading almost all columns from each table (approx 502 columns) and parquet is not ideal for this situation. But the interesting thing was spark was not creating blocks(partitions) for my data and caching entire dataset(~2GB) in just one executor.
Moreover, during my iterations, only one executor was doing all of the tasks.
Also, spark.default.parallelism
and spark.sql.shuffle.partitions
were not in my control. After changing it to Avro format I could actually tune the partitions, shuffles, each executor tasks etc. as per my need.
Hope this helps! Thank you.