Why rdd.getNumPartitions()
is triggering a job in my code below?
Please consider this code:
employee_df = spark.read.format('csv') \
.option('header', 'true') \
.load('/FileStore/tables/employee.csv')
print(employee_df.rdd.getNumPartitions())
Output:
1
At this stage, employee_df.rdd.getNumPartitions()
did not trigger any job and just printed number of partitions as 1.
But if I repartition the data and run employee_df.rdd.getNumPartitions()
again as follow:
employee_df = employee_df.repartition(2)
print(employee_df.rdd.getNumPartitions())
Output:
(1) Spark Jobs
Job 14
View
(Stages: 1/1)
Stage 21 1/1 succeeded View
2
I see that a job has been triggered. From what I have read, rdd.getNumPartitions()
is not an action. Then why is it triggering a job if it's not an action? Does it have something to do with repartitioning?
The task action which you were seeing is not for getNumPartitions()
but it is for repartition()
. The repartition()
method in Spark triggers a full shuffle, hence you see the task.