I'm trying to use NLTK to get the most common words used in an entire directory consisting of around a dozen sub-directories with around a dozen to two-dozen text files in each. I'm using the os.walk function and NLTK's FreqDist, but my code doesn't seem to work. I've tried lots but can't get it to run. Any help would be greatly appreciated!!
You don't need to include the sub-directories in the line:
with open(os.path.join(directory, subdirectory, file), "r") as f:
The error is because subdirectory
is a list.
You only need:
with open(os.path.join(directory, file), "r") as f:
To report the total for all the files, you can create one
frequency = nltk.FreqDist()
before the loop and then update it within the loop
frequency.update(useful_words)
and then report
print(frequency.most_common(20))
after the loop.
nltk.FreqDist
is a subclass of collections.Counter
.