Search code examples
pythonnlpgensimphrasesdoc2vec

How to use doc2vec with phrases?


i want to have phrases in doc2vec and i use gensim.phrases. in doc2vec we need tagged document to train the model and i cannot tag the phrases. how i can do this?

here is my code

text = phrases.Phrases(text)
for i in range(len(text)):
    string1 = "SENT_" + str(i)

    sentence = doc2vec.LabeledSentence(tags=string1, words=text[i])
    text[i]=sentence

print "Training model..."
model = Doc2Vec(text, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

Solution

  • The invocation of Phrases() trains a phrase-creating-model. You later use that model on text to get back phrase-combined text.

    Don't replace your original text with the trained model, as on your code's first line. Also, don't try to assign into the Phrases model, as happens in your current loop, nor access the Phrases model by integers.

    The gensim docs for the Phrases class has examples of the proper use of the Phrases class; if you follow that pattern you'll do well.

    Further, note that LabeledSentence has been replaced by TaggedDocument, and its tags argument should be a list-of-tags. If you provide a string, it will see that as a list-of-one-character tags (instead of the one tag you intend).