Search code examples
apache-sparkpysparkapache-spark-sql

Spark 'limit' does not run in parallel?


I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only one task running in the cluster.

This of course affects performance dramatically (removing the limit removes the single task bottleneck but lengthens the join as it works on a much larger dataset).

Is limit truly not parallelizable? and if so- is there a workaround for this?

I am using spark on Databricks cluster.

Edit: regarding the possible duplicate. The answer does not explain why everything is shuffled into a single partition. Also- I asked for advice to work around this issue.


Solution

  • Following the advice given by user8371915 in the comments, I used sample instead of limit. And it uncorked the bottleneck.

    A small but important detail: I still had to put a predictable size constraint on the result set after sample, but sample inputs a fraction, so the size of the result set can very greatly depending on the size of the input.

    Fortunately for me, running the same query with count() was very fast. So I first counted the size of the entire result set and used it to compute the fraction I later used in the sample.