Search code examples
rapache-sparkdplyrtidyversesparklyr

sparklyr sdf_collect and dplyr collect function on large tables in Spark takes ages to run?


I am running R Studio and R 3.5.2.

I have loaded around 250 parquet files using sparklyr::spark_read_parquet from S3a.

I need to collect the data from Spark (installed by sparklyr):

spark_install(version = "2.3.2", hadoop_version = "2.7")

But for some reason it takes ages to do the job. Sometimes the task is distributed to all CPU's and sometimes only one works: enter image description here

Please advise how would you solve the dplyr::collect or sparklyr::sdf_collect "running ages" issue.

Please also understand that I can't provide you with the data and if it's a small amount it will work significantly fast.


Solution

  • That is an expected behavior. dplyr::collect, sparklyr::sdf_collect or Spark's native collect will bring all data to the driver node.

    Even if feasible (you need at least 2-3 times more memory than the actual size of the data, depending on a scenario) it is bound to take a long time - with drivers network interfaces being the most obvious bottleneck.

    In practice if you're going to collect all the data it typically makes more sense to skip network and platform overhead and load data directly using native tools (given the description it would be to download data to the driver and convert to R friendly format file by file).