I would like to tag a list of documents by Gensim TaggedDocument()
, and then pass these documents as in input of Doc2Vec()
.
I have read the documentation about TaggedDocument
here, but I don' t have understood what exactly are the parameters words
and tags
.
I have tried:
texts = [[word for word in document.lower().split()]
for document in X.values]
texts = [[token for token in text]
for text in texts]
model = gensim.models.Doc2Vec(texts, vector_size=200)
model.train(texts, total_examples=len(texts), epochs=10)
But I get the error 'list' object has no attribute 'words'
.
Doc2Vec
expects an iterable collection of texts that are each (shaped like) the example TaggedDocument
class, with both words
and tags
properties.
The words
can be your tokenized text (as a list), but the tags
should be a list of document-tags that should be receive learned vectors via the Doc2Vec
algorithm. Most often, these are unique IDs, one per document. (You can just use plain int indexes, if that works as a way to refer to your documents elsewhere, or string IDs.) Note that tags
must be a list-of-tags, even if you're only providing one per document.
You are simply providing a list of lists-of-words, thus generating the error.
Try instead just the single line to initialize texts
:
texts = [TaggedDocument(
words=[word for word in document.lower().split()],
tags=[i]
) for i, document in enumerate(X.values)]
Also, you don't need to call train()
if you've supplied texts
when the Doc2Vec
was created. (By supplying the corpus at initialization, Doc2Vec
will automatically do both an initial vocabulary-discovery scan and then your specified number of training passes.)
You should look at working examples for inspiration, such as the doc2vec-lee.ipynb
runnable Jupyter notebook that's included with gensim
. It will be your install directory, if you can find it, but you can also view a (static, non-runnable) version inside the gensim
source code repository at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb