Search code examples
python-3.xpandasnltktext-miningcountvectorizer

How to find most frequnet words in a corpus in Pandas dataframe (Python)


I have Pandas dataframe that looks like following.I have tokenized my text files and used NLTK Countvectorizer to convert into pandas dataframe. In addition, I have already removed stopwords and punctuation from my coupus. I am trying to find most frequent words in my corpus in pandas dataframe. In below dataframe,words such as "aaron" and "abandon" aprreared >10 times, thus those words should be in new dataframe.

Note: I am new to python, and I am not sure how to implement this. Provide explanation with code.

Subset of the dataframe

I already already clean my corpus and my dataframe looks like following

{'aaaahhhs': {990: 0, 991: 0, 992: 0, 993: 0, 994: 0, 995: 0, 996: 0, 997: 0, 998: 0, 999: 0, 1000: 1}, 'aahs': {990: 0, 991: 0, 992: 0, 993: 0, 994: 0, 995: 0, 996: 0, 997: 0, 998: 0, 999: 0, 1000: 1}, 'aamir': {990: 0, 991: 0, 992: 0, 993: 0, 994: 0, 995: 0, 996: 0, 997: 0, 998: 0, 999: 0, 1000: 1}, 'aardman': {990: 0, 991: 0, 992: 0, 993: 0, 994: 0, 995: 0, 996: 0, 997: 0, 998: 0, 999: 0, 1000: 2}, 'aaron': {990: 0, 991: 0, 992: 0, 993: 0, 994: 0, 995: 0, 996: 4, 997: 0, 998: 0, 999: 0, 1000: 14}, 'abandon': {990: 0, 991: 0, 992: 0, 993: 0, 994: 0, 995: 0, 996: 0, 997: 0, 998: 0, 999: 0, 1000: 16}}

enter image description here


Solution

  • If need top N words:

    N = 2 
    print (df.sum().nlargest(N).index)
    Index(['aaron', 'abandon'], dtype='object')
    

    Another solution:

    print (df.sum().sort_values(ascending=False).index[:N])
    Index(['aaron', 'abandon'], dtype='object')
    

    If need also counts in one column DataFrame or Series (remove to_frame):

    N = 2
    print (df.sum().nlargest(N).to_frame('count'))
             count
    aaron       18
    abandon     16
    print (df.sum().sort_values(ascending=False).iloc[:N].to_frame('count'))
             count
    aaron       18
    abandon     16
    

    If need 2 column DataFrame:

    print (df.sum().nlargest(N).rename_axis('word').reset_index(name='count'))
          word  count
    0    aaron     18
    1  abandon     16
    
    print (df.sum()
             .sort_values(ascending=False).iloc[:N]
             .rename_axis('word')
             .reset_index(name='count'))
          word  count
    0    aaron     18
    1  abandon     16