Search code examples
rapache-sparksparklyr

Sparklyr cannot reference table in spark_apply


I want to use spark_apply to iterate through a number of data processes for feature generation. To do that I need to reference tables already loaded into spark but get the following error:

ERROR sparklyr: RScript (3076) terminated unexpectedly: object 'ref_table' not found

A reproducible example:

ref_table <-   sdf_along(sc, 10)
apply_table <- sdf_along(sc, 10)

spark_apply(x = apply_table, 
            f = function(x) {
              c(x, ref_table)
            })

I know I can reference libraries inside the function, but not sure how to call up the data. I am running a local spark cluster through rstudio.


Solution

  • Unfortunately the failure is to be expected here.

    Apache Spark, and because of that platforms based on it, doesn't support nested transformations like this one. You cannot use nested transformation, distributed objects or Spark context (spark_connection in case of sparklyr) from a worker code.

    For a detailed explanation please check my answer to Is there a reason not to use SparkContext.getOrCreate when writing a spark job?.

    Your question doesn't give enough context to determine the best course of action here, but in general there two possible solutions:

    • As long as one of the datasets is small enough to be stored in memory, use it directly in the closure as a plain R object.
    • Reformulate your problem as a join or the Cartesian product (Spark's crossJoin).