Currently I'm doing a frequency analysis on a text file that shows the top 100 commonly used words in the text file. Currently I'm using this code:
from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
print Counter(words).most_common (100)
The code above works and the outputs are:
[('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]
However, I want to display it in a table form with a header "Word" and "Count". I've tried using the prettytable
package and came up with this:
from collections import Counter
import re
import prettytable
words = re.findall(r'\w+', open('tweets.txt').read().lower())
for label, data in ('Word', words):
pt = prettytable(field_names=[label, 'Count'])
c = Counter(data)
[pt.add_row(kv) for kv in c.most_common() [:100] ]
pt.align [label], pt.align['Count'] = '1', 'r'
print pt
It gives me ValueError: too many values to unpack
. My question is, whats wrong with my code and is there a way to display the data using prettytable
? Also, how can I mend my code?
Bonus question: Is there a way to leave out certain words while counting the frequency? e.g skip the words: and, if, of etc etc
Thanks.
I am not sure how you expected the for
loop you wrote to work. The error you are getting is because you are attempting to iterate over the tuple ('Word', words)
which has two elements. The statement for label, data in ('Word', words)
attempts to assign 'W'
to label
, 'o'
to data
and ends up with 'r'
and 'd'
left over on the first iteration. Perhaps you meant to zip the items together instead? But then why are you making a new table for each word?
Here is a rewritten version:
from collections import Counter
import re, prettytable
words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
pt.add_row(row)
print pt
To skip elements in the most common count, you can simply discard them from the counter before calling most_common
. One easy way to do that is to define a list of invalid words, and then to filter them out with a dict comprehension:
bad_words = ['the', 'if', 'of']
c = Counter({k: v for k, v in c.items() if k not in bad_words})
Alternatively, you can do the filtering on the list of words before you make a counter out of it:
words = filter(lambda x: x not in bad_words, words)
I prefer operating on the counter because that requires less work since the data has already been aggregated. Here is the combined code for reference:
from collections import Counter
import re, prettytable
bad_words = ['the', 'if', 'of']
words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
c = Counter({k: v for k, v in c.items() if k not in bad_words})
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
pt.add_row(row)
print(pt)