I want to use spark_apply to iterate through a number of data processes for feature generation. To do that I need to reference tables already loaded into spark but get the following error:
ERROR sparklyr: RScript (3076) terminated unexpectedly: object 'ref_table' not found
A reproducible example:
ref_table <- sdf_along(sc, 10)
apply_table <- sdf_along(sc, 10)
spark_apply(x = apply_table,
f = function(x) {
c(x, ref_table)
})
I know I can reference libraries inside the function, but not sure how to call up the data. I am running a local spark cluster through rstudio.
Unfortunately the failure is to be expected here.
Apache Spark, and because of that platforms based on it, doesn't support nested transformations like this one. You cannot use nested transformation, distributed objects or Spark context (spark_connection
in case of sparklyr
) from a worker code.
For a detailed explanation please check my answer to Is there a reason not to use SparkContext.getOrCreate when writing a spark job?.
Your question doesn't give enough context to determine the best course of action here, but in general there two possible solutions:
join
or the Cartesian product (Spark's crossJoin
).