Search code examples
pythoncollectionsfrequencyprettytable

Python- displaying frequent words in a table and skipping certain words


Currently I'm doing a frequency analysis on a text file that shows the top 100 commonly used words in the text file. Currently I'm using this code:

from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
print Counter(words).most_common (100)

The code above works and the outputs are:

[('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]

However, I want to display it in a table form with a header "Word" and "Count". I've tried using the prettytable package and came up with this:

from collections import Counter
import re
import prettytable

words = re.findall(r'\w+', open('tweets.txt').read().lower())

for label, data in ('Word', words):
    pt = prettytable(field_names=[label, 'Count'])
    c = Counter(data)
    [pt.add_row(kv) for kv in c.most_common() [:100] ]
    pt.align [label], pt.align['Count'] = '1', 'r'
    print pt

It gives me ValueError: too many values to unpack. My question is, whats wrong with my code and is there a way to display the data using prettytable? Also, how can I mend my code?

Bonus question: Is there a way to leave out certain words while counting the frequency? e.g skip the words: and, if, of etc etc

Thanks.


Solution

  • I am not sure how you expected the for loop you wrote to work. The error you are getting is because you are attempting to iterate over the tuple ('Word', words) which has two elements. The statement for label, data in ('Word', words) attempts to assign 'W' to label, 'o' to data and ends up with 'r' and 'd' left over on the first iteration. Perhaps you meant to zip the items together instead? But then why are you making a new table for each word?

    Here is a rewritten version:

    from collections import Counter
    import re, prettytable
    
    words = re.findall(r'\w+', open('tweets.txt').read().lower())
    c = Counter(words)
    pt = prettytable.PrettyTable(['Words', 'Counts'])
    pt.align['Words'] = 'l'
    pt.align['Counts'] = 'r'
    for row in c.most_common(100):
        pt.add_row(row)
    print pt
    

    To skip elements in the most common count, you can simply discard them from the counter before calling most_common. One easy way to do that is to define a list of invalid words, and then to filter them out with a dict comprehension:

    bad_words = ['the', 'if', 'of']
    c = Counter({k: v for k, v in c.items() if k not in bad_words})
    

    Alternatively, you can do the filtering on the list of words before you make a counter out of it:

    words = filter(lambda x: x not in bad_words, words)
    

    I prefer operating on the counter because that requires less work since the data has already been aggregated. Here is the combined code for reference:

    from collections import Counter
    import re, prettytable
    
    bad_words = ['the', 'if', 'of']
    words = re.findall(r'\w+', open('tweets.txt').read().lower())
    
    c = Counter(words)
    c = Counter({k: v for k, v in c.items() if k not in bad_words})
    
    pt = prettytable.PrettyTable(['Words', 'Counts'])
    pt.align['Words'] = 'l'
    pt.align['Counts'] = 'r'
    for row in c.most_common(100):
        pt.add_row(row)
    
    print(pt)