Search code examples
pythongensimdoc2vec

gensim Doc2Vec: Getting from txt files to TaggedDocuments


Beginner here.

I have a large body of .txt files that I want to train a Doc2Vec model on. However, I am having trouble importing the data into python in a usable way.

To import data, I have used:

docLabels = []
docLabels = [f for f in listdir(“PATH TO YOU DOCUMENT FOLDER”) if 
f.endswith(‘.txt’)]
data = []
for doc in docLabels:
    data.append(open(‘PATH TO YOU DOCUMENT FOLDER’ + doc).read()) `

However, with this, I get a "list", which I can do no further work with. I cannot seem to find how to import text files in a way they can be used with NLTK / doc2vec anywhere on SO or in tutorials.

Help would be greatly appreciated. Thank you!


Solution

  • I'm only addressing the portion of the question indicated by the title, about Doc2Vec and TaggedDocument. (NLTK is a separate matter.)

    The TaggedDocument class requires you to specify words and tags for each object created.

    So where you are currently just appending a big full-read of the file to your data, you will instead want to:

    • break that data into words – one super-simple way is to just .split() it on whitespace, though most project do more
    • decide on a tag or tags, perhaps just the filename itself
    • instantiate a TaggedDocument, and append that to data

    So, you could replace your existing loop with:

    for doc in docLabels:
        words = open(open(‘PATH TO YOU DOCUMENT FOLDER’ + doc).read()).split()
        tags = [doc]
        data.append(TaggedDocument(words=words, tags=tags)