r apache-spark machine-learning apache-spark-mllib sparkr

SparkR Error in UseMethod("predict")

Following the ALS example here

...but running in distributed mode, e.g.

Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn sparkr-shell")
spark <- sparkR.session(master = "yarn",
                    sparkConfig = list(
                      spark.driver.memory = "2g",
                      spark.driver.extraJavaOptions =
                        paste("-Dhive.metastore.uris=",
                              Sys.getenv("HIVE_METASTORE_URIS"),
                              " -Dspark.executor.instances=",
                              Sys.getenv("SPARK_EXECUTORS"),
                              " -Dspark.executor.cores=",
                              Sys.getenv("SPARK_CORES"),
                              sep = "")
                    ))


ratings <- list(list(0, 0, 4.0), list(0, 1, 2.0), list(1, 1, 3.0), list(1, 2, 4.0),list(2, 1, 1.0), list(2, 2, 5.0))
df <- createDataFrame(ratings, c("user", "item", "rating"))
model <- spark.als(df, "rating", "user", "item")
stats <- summary(model)
userFactors <- stats$userFactors
itemFactors <- stats$itemFactors
# make predictions
summary(model)
predicted <- predict(object=model, data=df)

I get the following error:

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "ALSModel"

Looking at the source for 2.1.1 the method seems to exist, and the summary() function that is defined directly above it works just fine.

I have tried with with Spark, 2.1.0, 2.1.1, and 2.2.0-rc6, all of which give the same result. Also, this is not limited to the ALS model, calling predict() for any model gives the same error.

I also get the same error when I run it in local mode, e.g.

spark <- sparkR.session("local[*]")

Has anybody come across this problem before?

Solution

Although I have not reproduced exactly your error (I get a different one), most probably the problem is in the second argument of your predict call, which should be newData, and not data (see the documentation).

Here is an adaptation of your code for Spark 2.2.0 run locally from RStudio:

library(SparkR, lib.loc = "/home/ctsats/spark-2.2.0-bin-hadoop2.7/R/lib") # change the path accordingly here

sparkR.session(sparkHome = "/home/ctsats/spark-2.2.0-bin-hadoop2.7")      # and here

ratings <- list(list(0, 0, 4.0), list(0, 1, 2.0), list(1, 1, 3.0), list(1, 2, 4.0),list(2, 1, 1.0), list(2, 2, 5.0))
df <- createDataFrame(ratings, c("user", "item", "rating"))
model <- spark.als(df, "rating", "user", "item")
stats <- summary(model)
userFactors <- stats$userFactors
itemFactors <- stats$itemFactors
# make predictions
summary(model)
predicted <- predict(object=model, newData=df)  # newData here
showDF(predicted)
# +----+----+------+----------+
# |user|item|rating|prediction|
# +----+----+------+----------+
# | 1.0| 1.0|   3.0|  2.810426|
# | 2.0| 1.0|   1.0| 1.0784092|
# | 0.0| 1.0|   2.0|  1.997412|
# | 1.0| 2.0|   4.0| 3.9731808|
# | 2.0| 2.0|   5.0| 4.8602753|
# | 0.0| 0.0|   4.0| 3.8844662|
# +----+----+------+----------+

A simple predict(model, df) will also work.