I've gotten to the point where I can follow along with the example here (with only the slight modification of adding config=list()
to the input arguments).
sc <- spark_connect(master = "yarn-client", config=list())
library(dplyr)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
flights_tbl %>% filter(dep_delay == 2)
Source: query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
<int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 2 830 11 "UA" "N14228" 1545 "EWR" "IAH" 227 1400 5 17
2 2013 1 1 542 2 923 33 "AA" "N619AA" 1141 "JFK" "MIA" 160 1089 5 42
3 2013 1 1 702 2 1058 44 "B6" "N779JB" 671 "JFK" "LAX" 381 2475 7 2
4 2013 1 1 715 2 911 21 "UA" "N841UA" 544 "EWR" "ORD" 156 719 7 15
5 2013 1 1 752 2 1025 -4 "UA" "N511UA" 477 "LGA" "DEN" 249 1620 7 52
6 2013 1 1 917 2 1206 -5 "B6" "N568JB" 41 "JFK" "MCO" 145 944 9 17
7 2013 1 1 932 2 1219 -6 "VX" "N641VA" 251 "JFK" "LAS" 324 2248 9 32
8 2013 1 1 1028 2 1350 11 "UA" "N76508" 1004 "LGA" "IAH" 237 1416 10 28
9 2013 1 1 1042 2 1325 -1 "B6" "N529JB" 31 "JFK" "MCO" 142 944 10 42
10 2013 1 1 1231 2 1523 -6 "UA" "N402UA" 428 "EWR" "FLL" 156 1065 12 31
# ... with more rows
However, when I try to use other R functions like one might do with dplyr
things go awry:
flights_tbl %>% filter(dep_delay == 2 & grepl("A$", tailnum))
Source: query [?? x 16]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
Error: org.apache.spark.sql.AnalysisException: undefined function GREPL; line 4 pos 41
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:68)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:64)
at scala.util.Try.getOrElse(Try.scala:77)
at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:64)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.
Obviously grepl
is not supported. My question is: is there a way to use base R or R package functions? If not is it coming? It seems that work along these lines is progressing with dapply
and gapply
in SparkR
v2, but it would be great if it worked with sparklyr
.
Just saw this issue for sparklyr. Short answer is "not yet". Looking forward to future versions where this functionality is added.