Search code examples
nlpgensimdoc2vec

How to properly tag a list of documenta by Gensim TaggedDocument()


I would like to tag a list of documents by Gensim TaggedDocument(), and then pass these documents as in input of Doc2Vec().

I have read the documentation about TaggedDocument here, but I don' t have understood what exactly are the parameters words and tags.

I have tried:

texts = [[word for word in document.lower().split()]
          for document in X.values]

texts = [[token for token in text]
          for text in texts]

model = gensim.models.Doc2Vec(texts, vector_size=200)
model.train(texts, total_examples=len(texts), epochs=10)

But I get the error 'list' object has no attribute 'words'.


Solution

  • Doc2Vec expects an iterable collection of texts that are each (shaped like) the example TaggedDocument class, with both words and tags properties.

    The words can be your tokenized text (as a list), but the tags should be a list of document-tags that should be receive learned vectors via the Doc2Vec algorithm. Most often, these are unique IDs, one per document. (You can just use plain int indexes, if that works as a way to refer to your documents elsewhere, or string IDs.) Note that tags must be a list-of-tags, even if you're only providing one per document.

    You are simply providing a list of lists-of-words, thus generating the error.

    Try instead just the single line to initialize texts:

    texts = [TaggedDocument(
                 words=[word for word in document.lower().split()],
                 tags=[i]
             ) for i, document in enumerate(X.values)]
    

    Also, you don't need to call train() if you've supplied texts when the Doc2Vec was created. (By supplying the corpus at initialization, Doc2Vec will automatically do both an initial vocabulary-discovery scan and then your specified number of training passes.)

    You should look at working examples for inspiration, such as the doc2vec-lee.ipynb runnable Jupyter notebook that's included with gensim. It will be your install directory, if you can find it, but you can also view a (static, non-runnable) version inside the gensim source code repository at:

    https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb