Search code examples
ramazon-ec2sparklyr

Sparklyr on RStudio EC2 with invoke error hadoopConfiguration standalone cluster


So I have a 1 master/2 slave standalone cluster on EC2. I am running rstudio from EC2 and after I run the following code:

library(aws.s3)
library(sparklyr)
library(tidyverse)
library(RCurl)

Sys.setenv("AWS_ACCESS_KEY_ID" =  "myaccesskeyid",
           "AWS_SECRET_ACCESS_KEY" = "myaccesskey",
           "SPARK_CONF_DIR" = "/home/rstudio/spark/spark-2.1.0-bin-hadoop2.7/bin/",
           "JAVA_HOME" = "/usr/lib/jvm/java-8-oracle" )

ctx <- spark_context(sc)
jsc <- invoke_static(sc, 
                    "org.apache.spark.api.java.JavaSparkContext", 
                    "fromSparkContext", ctx)

hconf <- jsc %>% invoke("hadoopConfiguration")

The last line is where I encounter an error:

Error in do.call(.f, args, envir = .env) : 
  'what' must be a function or character string

From my research, I know that invoke is how sparklyr handles Java objects and I checked and confirmed my Java was installed in master/slaves and the JAVA_HOME was set.


Solution

  • When you call library(tidyverse) you see get conflicts info, which will explain what is going on:

     ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ✖ dplyr::filter() masks stats::filter()
    ✖ purrr::invoke() masks sparklyr::invoke()
    ✖ dplyr::lag()    masks stats::lag()
    

    As you see purrr::invoke, which has exactly the signature from the error message:

    invoke(.f, .x = NULL, ..., .env = NULL)
    

    shades sparklyr::invoke. Using fully qualified name

    jsc %>% sparklyr::invoke("hadoopConfiguration")
    

    should resolve the problem.