I am trying to load google's Pre-trained vectors 'GoogleNews-vectors-negative300.bin.gz' Google-word2vec into spark.
I converted the bin file to txt and created a smaller chunk for testing that I called 'vectors.txt'. I tried loading it as the following:
val sparkSession = SparkSession.builder
.master("local[*]")
.appName("Word2VecExample")
.getOrCreate()
val model2= Word2VecModel.load(sparkSession.sparkContext, "src/main/resources/vectors.txt")
val synonyms = model2.findSynonyms("the", 5)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
and to my surprise I am faced with the following error:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/elievex/Repository/ARCANA/src/main/resources/vectors.txt/metadata
I'm not sure where did the 'metadata' after 'vectors.txt' came from. I am using Spark, Scala and Scala IDE for Eclipse.
What am I doing wrong? is there a different way to load a pre-trained model in spark? Would appreciate any tips.
How exactly did you get vector.txt? If you read JavaDoc for Word2VecModel.save you may see that:
This saves: - human-readable (JSON) model metadata to path/metadata/ - Parquet formatted data to path/data/
The model may be loaded using Loader.load.
So what you need is model in Parquet format which is standard for Spark ML models.
Unfortunately load from Google's native format has not been implemented yet (see SPARK-9484).