Search code examples
data-analysisspark-koalas

How to speed up head function execution time in Koalas?


For large datasets, koalas.head(n) function takes a really long time. I understand that it tries to bring back all the data in driver node and then present the absolutely top n rows.

Is there any quick way to analyse top n rows in koalas such that only single or few partitions are involved to get the intended result? I do not want to necessarily see the absolute first n rows, they can be randomly distributed across different executor nodes or even reside within the same partition.


Solution

  • Adding this statement after importing Koalas seemed to help for me:

    koalas.set_option('compute.default_index_type', 'distributed-sequence')