Beginner here.
I have a large body of .txt files that I want to train a Doc2Vec model on. However, I am having trouble importing the data into python in a usable way.
To import data, I have used:
docLabels = []
docLabels = [f for f in listdir(“PATH TO YOU DOCUMENT FOLDER”) if
f.endswith(‘.txt’)]
data = []
for doc in docLabels:
data.append(open(‘PATH TO YOU DOCUMENT FOLDER’ + doc).read()) `
However, with this, I get a "list", which I can do no further work with. I cannot seem to find how to import text files in a way they can be used with NLTK / doc2vec anywhere on SO or in tutorials.
Help would be greatly appreciated. Thank you!
I'm only addressing the portion of the question indicated by the title, about Doc2Vec
and TaggedDocument
. (NLTK is a separate matter.)
The TaggedDocument
class requires you to specify words
and tags
for each object created.
So where you are currently just appending a big full-read of the file to your data
, you will instead want to:
.split()
it on whitespace, though most project do moreTaggedDocument
, and append that to data
So, you could replace your existing loop with:
for doc in docLabels:
words = open(open(‘PATH TO YOU DOCUMENT FOLDER’ + doc).read()).split()
tags = [doc]
data.append(TaggedDocument(words=words, tags=tags)