Search code examples
apache-sparkmachine-learningapache-spark-mllibapache-spark-ml

Word2Vec Model Storing Model in Single Part


val model: org.apache.spark.ml.feature.Word2VecModel = new Word2Vec().setNumPartitions(20).setInputCol("value").setOutputCol("feature").fit(copus)
word2VecModel.save(s"$HDFS_URL/w2vmodel")

When I saves this Model then it is creating only single partition under data folder part-r-0000-988jdu-sduj76-jh433.snappy.parquet with size 900 MB

val model: org.apache.spark.ml.feature.Word2VecModel =Word2VecModel.load("$HDFS_URL/w2vmodel")

So when I Am loading this Model then I Am Getting OutOfMemory Exception

Is There Any way This model can be save with Multiple part of parquet or any thing else

I Am newbee so any Suggestion will be appreciated


Solution

  • Coincidentally this problem has been recently discussed on the developers lists and this discussion resulted in a JIRA ticket and pull request:

    If you want a quick solution you can try to use MLlib implementation with Spark 2.0 or later (SPARK-11994).