I want to get word frequency per file in all files in a folder. However, it did not work.
C:\Python\Anaconda3\python.exe C:/Python/Anaconda3/frequency.py Traceback (most recent call last): File "C:/Python/Anaconda3/frequency.py", line 6, in for word in file.read().split(): NameError: name 'file' is not defined
How can I make it effectively? Thank you.
import glob
import os
path = 'C:\Python\Anaconda3'
for filename in glob.glob(os.path.join(path, '*.txt')):
wordcount = {}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
print(word, wordcount)
As the code stands, you have three obvious errors (although there may be more).
You have a for
loop where you change the name of the iterator
for **filename** in glob.glob(os.path.join(path, '*.txt')):
...
for word in **file**.read.split():
...
The wordcount
dictionary gets re-initialized (and thus erased) in each iteration of your for loop. You can fix this two ways depending on what you are trying to get at:
a. Move the line wordcount={}
to before you start your for
loops to prevent clearing out the dictionary after each file. This will give you a total wordcount
for all files.
b. Append wordcount
to another dictionary files
after each iteration of your loop, that way you have a dictionary where the keys are filenames, and the values are dictionaries containing your wordcounts. This can be a bit confusing, because you now have a dictionary of dictionaries. Referencing individual wordcounts becomes filecounts[filename][word] = count
.
Your method of printing dictionaries is incorrect, consider the following instead:
for word in wordcount:
print('{word}:\t{count}'.format(word=word, count=wordcount[word]))
I would also suggest using a default dictionary (see Docs, this would eliminate the need to check if a word
is in the dictionary, and set it to 1
.
So, in total, I would write it:
from collections import defaultdict
import glob
import os
path = 'C:\Python\Anaconda3'
filecounts = {}
for filename in glob.glob(os.path.join(path, '*.txt')):
wordcount = defaultdict(int)
for word in filename.read().split():
wordcount[word] += 1
filecounts[filename] = wordcount
for filename in filecounts:
print('Word count for file \'{file}\''.format(file=filename))
for word in filecounts[filename]:
print('\t{word}:\t{count}'.format(word=word, count=filecounts[filename][word]))