I am trying to use gensim's doc2vec to create a model which will be trained on a set of documents and a set of labels. The labels were created manually and need to be put into the program to be trained on. So far I have 2 lists: a list of sentences, and a list of labels corresponding to that sentence. I need to use doc2vec specifically. Here is what I have tried so far.
from gensim import utils
from gensim.models import Doc2Vec
tweets = ["A tweet", "Another tweet", "A third tweet", ... , "A thousandth-something tweet"]
labels_list = [1, 1, 3, ... , 16]
tagged_data = [tweets, labels_list]
model = Doc2Vec(size=20, alpha=0.025, min_alpha=0.00025, min_count=1, dm=1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
model.train(tagged_data, total_examples=model.corpus_count,
epochs=model.iter)
model.alpha -= 0.0002
model.min_alpha = model.alpha
I am getting the error on the line with model.build_vocab(tagged_data)
that there is an AttributeError: 'list' object has no attribute 'words'
. I googled this and it says to put it into a labeled sentence object, but I am not sure if that will work if I have predefined labels. So does anyone know how to put pre-defined labels into doc2vec? Thanks in advance.
The corpus for Doc2Vec
should be an iterable of objects that are similar to the TaggedDocument
example class included with gensim: with a words
list-of-string-tokens, and a tags
list-of-tags. (Tags are the keys to the doc-vectors that are learned by training from each text, and are most often unique document IDs, but can also be known labels that repeat over multiple documents, or both IDs and labels.)
Your tagged_data
, with one list of non-tokenized string, and one list of labels, is not at all like its expected format.
You should look at, and work through, some of the example Jupyter notebooks about Doc2Vec
in the gensim docs/notebooks
directory, such as doc2vec-lee.ipynb
or doc2vec-IMDB.ipynb
. These can also be viewed online, for example:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Also, you probably don't need to or want to call train()
multiple times - it's easy to get wrong. (If you've copied that approach from an online example, that example is likely out-of-date.) Call it once, with your preferred number of training passes in the epochs
parameter.