Search code examples
pythonnltkword-frequencytrend

Getting cumulative counts of word frequencies founds in a documents


I have been trying to detect word/bigram trends over pieces of text. What I have done so far is removing stop words, lowercasing and getting word frequencies and appended the top common 30 per text to a list,

e.g.

[(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1),...]

Then I converted the lists above to one huge list containing all words and their per doc frequencies and what I need to do now is get back a sorted list, i.e.:

[(u'snow', 32), (u'said.', 12), (u'GoT', 10), (u'death', 8), (u'entertainment', 4)..]

Any ideas?

Code:

fdists = []
for i in texts:
    words = FreqDist(w.lower() for w in i.split() if w.lower() not in    stopwords)
    fdists.append(words.most_common(30))

all_in_one = [item for sublist in fdists for item in sublist]

Solution

  • if all you want to do is sort your list you can use

    import operator
    
    fdists = [(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)]
    fdists2 = [(u'seeing', 3), (u'said.', 4), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2)]
    fdists += fdists2
    
    fdict = {}
    for i in fdists:
        if i[0] in fdict:
            fdict[i[0]] += i[1]
        else:
            fdict[i[0]] = i[1]
    
    sorted_f = sorted(fdict.items(), key=operator.itemgetter(1), reverse=True)
    print sorted_f[:30]
    
    [(u'said.', 6), (u'seeing', 5), (u'death', 4), (u'entertainment', 4), (u'read', 4), (u'it\u2019s', 4), (u'weiss', 4), (u'one', 4), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)]
    

    Another way you can handle duplicates is to use pandas groupby() function and then use the sort() function to sort by count and word like so

    from pandas import *
    import pandas as pd
    
    fdists = [(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)]
    fdists2 = [(u'seeing', 3), (u'said.', 4), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2)]
    fdists += fdists2
    
    df = DataFrame(data = fdists, columns = ['word','count'])
    df= DataFrame([{'word': k, 'count': (v['count'].sum())} for k,v in df.groupby(['word'])], columns = ['word','count'])
    
    Sorted = df.sort(['count','word'], ascending = [0,1])
    print Sorted[:30]
    
                 word  count
    8           said.      6
    9          seeing      5
    2           death      4
    3   entertainment      4
    4            it’s      4
    5             one      4
    7            read      4
    12          weiss      4
    0          bloody      1
    1          dead,”      1
    6          people      1
    10           shot      1
    11         show’s      1
    13            “it      1