Search code examples
apache-sparknlppysparkapache-spark-mllibword2vec

How to obtain the word list from pyspark word2vec model?


I am trying to generate word vectors using PySpark. Using gensim I can see the words and the closest words as below:

sentences = open(os.getcwd() + "/tweets.txt").read().splitlines()
w2v_input=[]
for i in sentences:
    tokenised=i.split()
    w2v_input.append(tokenised)
model = word2vec.Word2Vec(w2v_input)
for key in model.wv.vocab.keys():
    print key
    print model.most_similar(positive=[key])

Using PySpark

inp = sc.textFile("tweet.txt").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)

How can I generate the words from the vector space in model? That is the pyspark equivalent of the gensim model.wv.vocab.keys()?

Background: I need to store the words and the synonyms from the model in a map so I can use them later for finding the sentiment of a tweet. I cannot reuse the word-vector model in the map functions in pyspark as the model belongs to the spark context (error pasted below). I want the pyspark word2vec version instead of gensim because it provides better synonyms for certain test words.

 Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.SparkContext can only be used on the driver, not in code that it run on workers.

Any alternative solution is also welcome.


Solution

  • The equivalent command in Spark is model.getVectors(), which again returns a dictionary. Here is a quick toy example with only 3 words (alpha, beta, charlie), adapted from the documentation:

    sc.version
    # u'2.1.1'
    
    from pyspark.mllib.feature import Word2Vec
    sentence = "alpha beta " * 100 + "alpha charlie " * 10
    localDoc = [sentence, sentence]
    doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    word2vec = Word2Vec()
    model = word2vec.fit(doc)
    
    model.getVectors().keys()
    #  [u'alpha', u'beta', u'charlie']
    

    Regarding finding synonyms, you may find another answer of mine useful.

    Regarding the error you mention and a possible workaround, have a look at this answer of mine.