Search code examples
apache-sparkoptimizationcollect

How can I avoid a collectAsList() when path is read from another data source?


I have the parquet path present as a table column, need to then pass this column list as an input to readFrom parquet.

List<Row> rows = spark.read.<datasource>.select("path").collectAsList();

List<String> paths =  <convert the rows to string>

spark.read.parquet(paths). 

The collectAsList is an expensive operation with data being brought to the driver.

Is there a better approach?


Solution

  • Is there a better approach?

    No, there is no alternative way.

    The code represented by spark.read.parquet will always be executed on the driver. The driver will tell individual executors which part of with parquet file the respective executor should load and the executors will load the data. But the coordination which executor should handle which part of a parquet file is the task of the driver. So the paths have to be shipped to the driver.

    After the bad news here is the good part: it is true that collectAsList is expensive but it is not that expensive. collectAsList is expensive when dealing with huge dataframes. Huge in this context means hundereds of millions of rows. I doubt that you are planning to load that many parquet files. As long as the list of paths "only" contains a few ten thousand rows there is nothing wrong with sending this list to the driver. A standard JVM that runs the driver will easily handle such a list.