Search code examples
pythonpython-2.6defaultdict

Print 10 most infrequent words of a text document using python


I have a small python script that prints the 10 most frequent words of a text document (with each word being 2 letters or more) and I need to continue the script to print the 10 most INfrequent words in the document as well. I have a script that is relatively working, except the 10 most infrequent words it prints are numbers (integers and floaters) when they should be words. How can I iterate ONLY words and exclude the numbers? Here is my full script:

# Most Frequent Words:
from string import punctuation
from collections import defaultdict

number = 10
words = {}

with open("charactermask.txt") as txt_file:
    words = [x.strip(punctuation).lower() for x in txt_file.read().split()]

counter = defaultdict(int)

for word in words:
  if len(word) >= 2:
    counter[word] += 1

top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)

EDIT: The end of the document (the part under the # Least Frequent Words comment) is the part that needs fixing.


Solution

  • You're going to need a filter -- change the regex to match however you want to define a "word":

    import re
    alphaonly = re.compile(r"^[a-z]{2,}$")
    

    Now, do you want the word frequency table to not include numbers in the first place?

    counter = defaultdict(int)
    
    with open("charactermask.txt") as txt_file:
        for line in txt_file:
            for word in line.strip().split():
              word = word.strip(punctuation).lower()
              if alphaonly.match(word):
                  counter[word] += 1
    

    Or do you just want to skip over the numbers when extracting the least-frequent words from the table?

    words_by_freq = sorted(counter.iteritems(),
                           key=lambda(word, count): (count, word))
    
    i = 0
    for word, frequency in words_by_freq:
        if alphaonly.match(word):
            i += 1
            sys.stdout.write("{}: {}\n".format(word, frequency))
        if i == number: break