Search code examples
scalahdfssmile

Smile - Model Persistence - How to write models to HDFS?


I am trying to use Smile in my Scala project which uses Spark and HDFS. For reusability of my models, I need to write them to HDFS.

Right now I am using the write object, checking if the path exists beforehand and creating it if it does not (otherwise it would throw a FileNotFoundException) :

import java.nio.file.Paths

val path: String = "hdfs:/my/hdfs/path"
val outputPath: Path = Paths.get(path)
val outputFile: File = outputPath.toFile
if(!outputFile.exists()) {
  outputFile.getParentFile().mkdirs();  // This is a no-op if it exists
  outputFile.createNewFile();
}
write(mySmileModel, path)

but this creates locally the path "hdfs:/my/hdfs/path" and writes the model in it, instead of actually writing to HDFS.
Note that using a spark model and its save method works:

mySparkModel.save("hdfs:/my/hdfs/path")

Therefore my question: How to write a Smile model to HDFS?
Similarly, if I manage to write a model to HDFS, I will probably also wonder how to read a model from HDFS.

Thanks!


Solution

  • In the end, I solved my problem by writing my own save method for my wrapper class, which roughly amounts to:

    import org.apache.hadoop.fs.{FSDataInputStream, FSDataOutputStream, FileSystem, Path}
    import org.apache.hadoop.conf.Configuration
    import java.io.{ObjectInputStream, ObjectOutputStream}
    
    val path: String = /my/hdfs/path
    val file: Path = new Path(path)
    val conf: Configuration = new Configuration()
    val hdfs: FileSystem = FileSystem.get(new URI(path), conf)
    val outputStream: FSDataOutputStream = hdfs.create(file)
    val objectOutputStream: ObjectOutputStream = new ObjectOutputStream(outputStream)
    objectOutputStream.writeObject(model)
    objectOutputStream.close()
    

    Similarly, for loading the saved model I wrote a method doing roughly the following:

    val conf: Configuration = new Configuration()
    val path: String = /my/hdfs/path
    val hdfs: FileSystem = FileSystem.get(new URI(path), conf)
    val inputStream: FSDataInputStream = hdfs.open(new Path(path))
    val objectInputStream: ObjectInputStream = new ObjectInputStream(inputStream)
    val model: RandomForest = objectInputStream.readObject().asInstanceOf[RandomForest]