I have 271 parquet small files (9KB/file) under the same directory on s3 bucket. So, I'm trying to understand how spark gets the number of tasks when reading those files?
The cluster is aws EMR 5.29 and my sparkConf have --num-executors 2
and --executor-cores 2
When I run spark.read.parquet("s3://bucket/path").rdd.getNumPartitions
I got 9 tasks/partition, my question is why? How it works?
I found the answer here:
Min(defaultMinSplitSize (128MB, `spark.sql.files.maxPartitionBytes`,
Max(openCostInByte(8MB, `spark.sql.files.openCostInBytes`,
totalSize/defaultParallelism)
)