Search code examples
apache-sparkpyspark

Why rdd.getNumPartitions() is triggering a job in spark?


Why rdd.getNumPartitions() is triggering a job in my code below?

Please consider this code:

employee_df = spark.read.format('csv') \
                        .option('header', 'true') \
                            .load('/FileStore/tables/employee.csv')

print(employee_df.rdd.getNumPartitions())

Output: 1

At this stage, employee_df.rdd.getNumPartitions() did not trigger any job and just printed number of partitions as 1.

But if I repartition the data and run employee_df.rdd.getNumPartitions() again as follow:

employee_df = employee_df.repartition(2)
print(employee_df.rdd.getNumPartitions())

Output:

(1) Spark Jobs
Job 14 

View
(Stages: 1/1)

Stage 21 1/1 succeeded   View
2

I see that a job has been triggered. From what I have read, rdd.getNumPartitions() is not an action. Then why is it triggering a job if it's not an action? Does it have something to do with repartitioning?


Solution

  • The task action which you were seeing is not for getNumPartitions() but it is for repartition(). The repartition() method in Spark triggers a full shuffle, hence you see the task.