Search code examples
pythongensimsupervised-learningdoc2vec

Doc2Vec gensim with supervised data predefined labels


I am trying to use gensim's doc2vec to create a model which will be trained on a set of documents and a set of labels. The labels were created manually and need to be put into the program to be trained on. So far I have 2 lists: a list of sentences, and a list of labels corresponding to that sentence. I need to use doc2vec specifically. Here is what I have tried so far.

from gensim import utils
from gensim.models import Doc2Vec

tweets = ["A tweet", "Another tweet", "A third tweet", ... , "A thousandth-something tweet"]
labels_list = [1, 1, 3, ... , 16]

tagged_data = [tweets, labels_list]
model = Doc2Vec(size=20, alpha=0.025, min_alpha=0.00025, min_count=1, dm=1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
    model.train(tagged_data, total_examples=model.corpus_count, 
epochs=model.iter)
    model.alpha -= 0.0002
    model.min_alpha = model.alpha

I am getting the error on the line with model.build_vocab(tagged_data) that there is an AttributeError: 'list' object has no attribute 'words'. I googled this and it says to put it into a labeled sentence object, but I am not sure if that will work if I have predefined labels. So does anyone know how to put pre-defined labels into doc2vec? Thanks in advance.


Solution

  • The corpus for Doc2Vec should be an iterable of objects that are similar to the TaggedDocument example class included with gensim: with a words list-of-string-tokens, and a tags list-of-tags. (Tags are the keys to the doc-vectors that are learned by training from each text, and are most often unique document IDs, but can also be known labels that repeat over multiple documents, or both IDs and labels.)

    Your tagged_data, with one list of non-tokenized string, and one list of labels, is not at all like its expected format.

    You should look at, and work through, some of the example Jupyter notebooks about Doc2Vec in the gensim docs/notebooks directory, such as doc2vec-lee.ipynb or doc2vec-IMDB.ipynb. These can also be viewed online, for example:

    https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

    Also, you probably don't need to or want to call train() multiple times - it's easy to get wrong. (If you've copied that approach from an online example, that example is likely out-of-date.) Call it once, with your preferred number of training passes in the epochs parameter.