apache-spark nlp pyspark apache-spark-mllib word2vec

How to obtain the word list from pyspark word2vec model?

I am trying to generate word vectors using PySpark. Using gensim I can see the words and the closest words as below:

sentences = open(os.getcwd() + "/tweets.txt").read().splitlines()
w2v_input=[]
for i in sentences:
    tokenised=i.split()
    w2v_input.append(tokenised)
model = word2vec.Word2Vec(w2v_input)
for key in model.wv.vocab.keys():
    print key
    print model.most_similar(positive=[key])

Using PySpark

inp = sc.textFile("tweet.txt").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)

How can I generate the words from the vector space in model? That is the pyspark equivalent of the gensim model.wv.vocab.keys()?

Background: I need to store the words and the synonyms from the model in a map so I can use them later for finding the sentiment of a tweet. I cannot reuse the word-vector model in the map functions in pyspark as the model belongs to the spark context (error pasted below). I want the pyspark word2vec version instead of gensim because it provides better synonyms for certain test words.

 Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.SparkContext can only be used on the driver, not in code that it run on workers.

Any alternative solution is also welcome.

Solution

The equivalent command in Spark is model.getVectors(), which again returns a dictionary. Here is a quick toy example with only 3 words (alpha, beta, charlie), adapted from the documentation:

sc.version
# u'2.1.1'

from pyspark.mllib.feature import Word2Vec
sentence = "alpha beta " * 100 + "alpha charlie " * 10
localDoc = [sentence, sentence]
doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(doc)

model.getVectors().keys()
#  [u'alpha', u'beta', u'charlie']

Regarding finding synonyms, you may find another answer of mine useful.

Regarding the error you mention and a possible workaround, have a look at this answer of mine.