python nltk attributeerror os.walk word-frequency

Trying to use os.walk to and FreqDist for all files in a directory and subdirectories

I'm trying to use NLTK to get the most common words used in an entire directory consisting of around a dozen sub-directories with around a dozen to two-dozen text files in each. I'm using the os.walk function and NLTK's FreqDist, but my code doesn't seem to work. I've tried lots but can't get it to run. Any help would be greatly appreciated!!

Solution

You don't need to include the sub-directories in the line:

with open(os.path.join(directory, subdirectory, file), "r") as f:

The error is because subdirectory is a list.

You only need:

 with open(os.path.join(directory, file), "r") as f:

To report the total for all the files, you can create one

frequency = nltk.FreqDist()

before the loop and then update it within the loop

frequency.update(useful_words)

and then report

print(frequency.most_common(20))

after the loop.

nltk.FreqDist is a subclass of collections.Counter.