I have a small python script I am working on for a class homework assignment. The script reads a file and prints the 10 most frequent and infrequent words and their frequencies. For this assignment, a word is defined as 2 letters or more. I have the word frequencies working just fine, however the third part of the assignment is to print the total number of unique words in the document. Unique words meaning count every word in the document, only once.
Without changing my current script too much, how can I count all the words in the document only one time?
p.s. I am using Python 2.6 so please don't mention the use of collections.Counter
from string import punctuation
from collections import defaultdict
import re
number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)
"""Define words as 2+ letters"""
def count_unique(s):
count = 0
if word in line:
if len(word) >= 2:
count += 1
return count
"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
counter[word] += 1
# Most Frequent Words
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
print "Most Frequent Words: "
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
print " "
print "Least Frequent Words: "
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique
Count the number of key
s in your counter
dictionary:
total_unique = len(counter.keys())
Or more simply:
total_unique = len(counter)