python apache-spark pyspark spark-koalas

Koalas applymap moving all data to a single partition

I need to do element-wise operation on a Koalas DataFrame. I use for that the Koalas applymap method. On the execution Koalas moves all data to one partition and then applies the operation. The outcome is that the performance of the job is very poor.

>>> sdf = spark.range(0, 10**7, 1, 10).toDF('col1').withColumn('col2', F.lit('[1,2]'))

>>> kdf = ks.DataFrame(sdf)

>>> kdf_new = kdf[['col2']].applymap(eval)

WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

How to force Koalas not to shuffle data and apply the operation in existing partitions?

Solution

You can fix this by specifying an index on the Koalas DataFrame. The default index is expected to give poor performance. Read about default index type in Koalas.

Either specify a different default index

ks.options.compute.default_index_type = 'distributed-sequence'

or specify a specific index (i.e. don't use default) on your DataFrame

kdf = kdf.set_index('col1')