Perhaps somewhat similar to this question, it doesn't seem like SparkR dataframes are compatible with the caret package.
When I try to train my model, I get the following error:
Error in as.data.frame.default(data) :
cannot coerce class "structure("SparkDataFrame", package = "SparkR")" to a data.frame
Is there any way around this? Below's a reproducible example using iris:
#load libraries
library(caret)
library(randomForest)
set.seed(42)
#point R session to Spark
Sys.setenv(SPARK_HOME = "your/spark/installation/here")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
#load SparkR
library(SparkR)
#initialize Spark context
sc <- sparkR.init(master = "local",sparkEnvir = list(spark.driver.memory="2g"))
#initialize SQL context
sqlContext <- sparkRSQL.init(sc)
train2 <- createDataFrame(sqlContext, iris)
#train the model
model <- train(Species ~ Sepal_Length + Petal_Length,
data = train2,
method = "rf",
trControl = trainControl(method = "cv", number = 5)
)
Again, any way around this? If not, what's the most straightforward path to machine learning with SparkR?
you can't use caret
's training methods on SparkDataFrames
, as you've discovered. You can however use Spark-ml
's algorithms, for instance to train a random forest classifier, using SparkR::spark.randomForest
:
#train the model
model <- spark.randomForest(train2,
type="classification",
Species ~ Sepal_Length + Petal_Length,
maxDepth = 5,
numTrees = 100)
summary(model)