Search code examples
pythonunique

How can you use Python to count the unique words (without special characters/ cases interfering) in a text document


I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:

  1. Total word count
  2. Total count of unique words (without case and special characters interfering)
  3. The number of sentences
  4. Average words in a sentence
  5. Find common used phrases (a phrase of 3 or more words used over 3 times)
  6. A list of words used, in order of descending frequency (without case and special characters interfering)
  7. The ability to accept input from STDIN, or from a file specified on the command line

So far I have this Python program to print total word count:

with open('/Users/name/Desktop/20words.txt', 'r') as f:

     p = f.read()

     words = p.split()

     wordCount = len(words)
     print "The total word count is:", wordCount

So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog, dog., "dog, and dog, as different words)

 file=open("/Users/name/Desktop/20words.txt", "r+")

 wordcount={}

 for word in file.read().split():

     if word not in wordcount:
         wordcount[word] = 1
     else:
         wordcount[word] += 1
 for k, v in wordcount.items():
     print k, v

Thank you for any help you can give!


Solution

  • Certainly the most difficult part is identifying the sentences. You could use a regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter for counting all of those instead of doing this manually. Use str.lower to convert either the text as a whole or the individual words to lowercase.

    This should help you getting startet:

    import re, collections
    text = """Sentences start with an upper-case letter. Do they always end 
    with a dot? No! Also, not each dot is the end of a sentence, e.g. these two, 
    but this is. Still, some ambiguity remains with names, like Mr. Miller here."""
    
    sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)    
    sentences = collections.Counter(sentence.findall(text))
    for n, s in sentences.most_common():
        print n, s
    
    word = re.compile(r"\w+")
    words = collections.Counter(word.findall(text.lower()))
    for n, w in words.most_common():
        print n, w
    

    For "more power", you could use some natural language toolkit, but this might be a bit much for this task.