I hope to use mapPartitions and reduce function of Spark (http://spark.apache.org/docs/latest/programming-guide.html), using sparklyr.
It is easy in pyspark, the only thing I need to use is a plain python code. I can simply add python functions as callback function. So easy.
For example, in pyspark, I can use those two functions as follows:
mapdata = self.rdd.mapPartitions(mycbfunc1(myparam1))
res = mapdata.reduce(mycbfunc2(myparam2))
However, it seems this is not possible in R, for example sparklyr library. I checked RSpark, but it seems it is another way of query/wrangling data in R, nothing else.
I would appreciate if someone let me know how to use those two functions in R, with R callback functions.
In SparkR
you could use internal functions - hence the prefix SparkR:::
- to accomplish the same.
newRdd = SparkR:::toRDD(self)
mapdata = SparkR:::mapPartitions(newRdd, function(x) { mycbfunc1(x, myparam1)})
res = SparkR:::reduce(mapdata, function(x) { mycbfunc2(x, myparam2)})
I believe sparklyr
interfaces only with the DataFrame
/ DataSet
API.