apache-spark machine-learning apache-spark-mllib apache-spark-ml

Word2Vec Model Storing Model in Single Part

val model: org.apache.spark.ml.feature.Word2VecModel = new Word2Vec().setNumPartitions(20).setInputCol("value").setOutputCol("feature").fit(copus)
word2VecModel.save(s"$HDFS_URL/w2vmodel")

When I saves this Model then it is creating only single partition under data folder part-r-0000-988jdu-sduj76-jh433.snappy.parquet with size 900 MB

val model: org.apache.spark.ml.feature.Word2VecModel =Word2VecModel.load("$HDFS_URL/w2vmodel")

So when I Am loading this Model then I Am Getting OutOfMemory Exception

Is There Any way This model can be save with Multiple part of parquet or any thing else

I Am newbee so any Suggestion will be appreciated

Solution

Coincidentally this problem has been recently discussed on the developers lists and this discussion resulted in a JIRA ticket and pull request:

SPARK-19247 - improve ml word2vec save/load
https://github.com/apache/spark/pull/16607

If you want a quick solution you can try to use MLlib implementation with Spark 2.0 or later (SPARK-11994).