How Apache Spark partitions data of a big file

Let's say I have a cluster of 4 nodes each having 1 core. I have a 600 Petabytes size big file which I want to process through Spark. File could be stored in HDFS.

I think that way to determine no. of partitions is file size / total no. of cores in the cluster. If that is the case indeed, I will have 4 partitions(600/4) so each partition will be of 125 PB size.

But I think 125 PB is too big a size for partition so is my thinking correct related to deducing no. of partitions.

PS: I have just started with Apache Spark. So, apologies if this is a naive question.

Solution

As you are storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration. (Lets assume 128 MB Blocks.)

So 600 petabytes will result in 4687500000 blocks of 128 MB each. (600 petabytes/128 MB)

Now when you run your Spark job, each executor will read few blocks of data (number of blocks will be equal to the number of cores in executor) and process them in parallel.

Basically, each core will process 1 partition. So the more cores you give to an executor the more data it can process, but at the same time you will need to allocate more memory to executor to handle the size of data loaded in memory.

It is advised to have moderate size executors. Having too many small executors will cause a lot of data shuffle.

Now coming to your scenario, if you have a 4 node cluster with 1 core each. You will have 3 executors running on them at max as 1 core will be taken for spark driver. So to process the data, you will be able to process 3 partitions in parallel. so it will take your job 4687500000/3 = 1562500000 iteration to process the whole data.

Hope that helps!

Cheers!