gensim Doc2Vec: Getting from txt files to TaggedDocuments

Beginner here.

I have a large body of .txt files that I want to train a Doc2Vec model on. However, I am having trouble importing the data into python in a usable way.

To import data, I have used:

docLabels = []
docLabels = [f for f in listdir(“PATH TO YOU DOCUMENT FOLDER”) if 
f.endswith(‘.txt’)]
data = []
for doc in docLabels:
    data.append(open(‘PATH TO YOU DOCUMENT FOLDER’ + doc).read()) `

However, with this, I get a "list", which I can do no further work with. I cannot seem to find how to import text files in a way they can be used with NLTK / doc2vec anywhere on SO or in tutorials.

Help would be greatly appreciated. Thank you!

Solution

I'm only addressing the portion of the question indicated by the title, about Doc2Vec and TaggedDocument. (NLTK is a separate matter.)

The TaggedDocument class requires you to specify words and tags for each object created.

So where you are currently just appending a big full-read of the file to your data, you will instead want to:

break that data into words – one super-simple way is to just .split() it on whitespace, though most project do more
decide on a tag or tags, perhaps just the filename itself
instantiate a TaggedDocument, and append that to data

So, you could replace your existing loop with:

for doc in docLabels:
    words = open(open(‘PATH TO YOU DOCUMENT FOLDER’ + doc).read()).split()
    tags = [doc]
    data.append(TaggedDocument(words=words, tags=tags)