I've been trying to follow along things blog post:
https://www.phdata.io/exploring-spark-mllib-part-4-exporting-the-model-for-use-outside-of-spark/
Using spark 2.1 with built in Hadoop 2.7 run locally I can save a model:
trainedModel.save("mymodel.model"))
However if I try to load the model from a regular scala (sbt) shell hdfs fails to load.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.{PipelineModel, Predictor}
val sc = new SparkContext(new SparkConf().setMaster("local[1]").setAppName("myApp"))
val model = PipelineModel.load("mymodel.model")
I get this is error:
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.DistributedFileSystem could not be instantiated
Is it in fact possible to use a spark model without calling spark-submit, or spark-shell? The article I linked to was the only one I'd seen mentioning such functionality.
My build.sbt is using the following dependencies:
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" % "spark-sql_2.11" % "2.1.0",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.11" % "2.1.0",
"org.apache.hadoop" % "hadoop-hdfs" % "2.7.0"
In both cases I am using Scala 2.11.8.
Edit: Okay it looks including this was the source of the problem
"org.apache.hadoop" % "hadoop-hdfs" % "2.7.0"
I removed that line and the problem went away
try:
trainedModel.write.overwrite().save("mymodel.model"))
Also if your model is saved locally, you can remove hdfs in your configuration. This should prevent spark from attempting to instantiate hdfs.