Search code examples
scalaapache-sparkhdfsapache-spark-ml

Using spark ML models outside of spark [hdfs DistributedFileSystem could not be instantiated]


I've been trying to follow along things blog post:

https://www.phdata.io/exploring-spark-mllib-part-4-exporting-the-model-for-use-outside-of-spark/

Using spark 2.1 with built in Hadoop 2.7 run locally I can save a model:

trainedModel.save("mymodel.model"))

However if I try to load the model from a regular scala (sbt) shell hdfs fails to load.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.{PipelineModel, Predictor}

val sc = new SparkContext(new SparkConf().setMaster("local[1]").setAppName("myApp"))

val model = PipelineModel.load("mymodel.model")

I get this is error:

java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.DistributedFileSystem could not be instantiated

Is it in fact possible to use a spark model without calling spark-submit, or spark-shell? The article I linked to was the only one I'd seen mentioning such functionality.

My build.sbt is using the following dependencies:

"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" % "spark-sql_2.11" % "2.1.0",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.11" % "2.1.0",
"org.apache.hadoop" % "hadoop-hdfs" % "2.7.0"

In both cases I am using Scala 2.11.8.

Edit: Okay it looks including this was the source of the problem

"org.apache.hadoop" % "hadoop-hdfs" % "2.7.0"

I removed that line and the problem went away


Solution

  • try:

    trainedModel.write.overwrite().save("mymodel.model"))
    

    Also if your model is saved locally, you can remove hdfs in your configuration. This should prevent spark from attempting to instantiate hdfs.