Search code examples
rr-caretsparkr

Using Caret with SparkR?


Perhaps somewhat similar to this question, it doesn't seem like SparkR dataframes are compatible with the caret package.

When I try to train my model, I get the following error:

    Error in as.data.frame.default(data) : 
  cannot coerce class "structure("SparkDataFrame", package = "SparkR")" to a data.frame

Is there any way around this? Below's a reproducible example using iris:

#load libraries
library(caret)
library(randomForest)
set.seed(42)

#point R session to Spark
Sys.setenv(SPARK_HOME = "your/spark/installation/here")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

#load SparkR
library(SparkR)

#initialize Spark context
sc <- sparkR.init(master = "local",sparkEnvir = list(spark.driver.memory="2g"))

#initialize SQL context
sqlContext <- sparkRSQL.init(sc)

train2 <- createDataFrame(sqlContext, iris)

#train the model
model <- train(Species ~ Sepal_Length + Petal_Length,
               data = train2,
               method = "rf",
               trControl = trainControl(method = "cv", number = 5)

)

Again, any way around this? If not, what's the most straightforward path to machine learning with SparkR?


Solution

  • you can't use caret's training methods on SparkDataFrames, as you've discovered. You can however use Spark-ml's algorithms, for instance to train a random forest classifier, using SparkR::spark.randomForest:

    #train the model
    model <- spark.randomForest(train2,
                                type="classification",  
                                Species ~ Sepal_Length + Petal_Length,
                                maxDepth = 5,
                                numTrees = 100)
    
    summary(model)