For large datasets, koalas.head(n)
function takes a really long time. I understand that it tries to bring back all the data in driver node and then present the absolutely top n rows.
Is there any quick way to analyse top n rows in koalas such that only single or few partitions are involved to get the intended result? I do not want to necessarily see the absolute first n rows, they can be randomly distributed across different executor nodes or even reside within the same partition.
Adding this statement after importing Koalas seemed to help for me:
koalas.set_option('compute.default_index_type', 'distributed-sequence')