Search code examples
pythonnltkattributeerroros.walkword-frequency

Trying to use os.walk to and FreqDist for all files in a directory and subdirectories


I'm trying to use NLTK to get the most common words used in an entire directory consisting of around a dozen sub-directories with around a dozen to two-dozen text files in each. I'm using the os.walk function and NLTK's FreqDist, but my code doesn't seem to work. I've tried lots but can't get it to run. Any help would be greatly appreciated!!


Solution

  • You don't need to include the sub-directories in the line:

    with open(os.path.join(directory, subdirectory, file), "r") as f:
    

    The error is because subdirectory is a list.

    You only need:

     with open(os.path.join(directory, file), "r") as f:
    

    To report the total for all the files, you can create one

    frequency = nltk.FreqDist()
    

    before the loop and then update it within the loop

    frequency.update(useful_words)
    

    and then report

    print(frequency.most_common(20))
    

    after the loop.

    nltk.FreqDist is a subclass of collections.Counter.