Search code examples
pythonpandasdataframenlp

How to break down the top words per document in a row; Pandas Dataframe


I am trying to break down the text column of a dataframe, and get the top words broken down per row/document. I have the top words, in this example it is machine and learning both at counts of 8. However I'm unsure how to break down the top words per document instead of the whole dataframe.

Below are the results for the top words for the dataframe as a whole:

machine 8

learning 8

important 2

think 1

significant 1

import pandas as pd
y = ['machine learning. i think machine learning rather significant machine learning',
     'most important aspect is machine learning. machine learning very important essential',
    'i believe machine learning great, machine learning machine learning']
x = ['a','b','c']
practice = pd.DataFrame(data=y,index=x,columns=['text'])

What I am expecting is next to the text column, is another column that indicates the top word. For Example for the word 'Machine' the dataframe should look like:

a / … / 3

b / … / 2

c / … / 3


Solution

  • You can perform the following using the Counter from the collections module.

    import pandas as pd
    from collections import Counter
    y = ['machine learning. i think machine learning rather significant machine learning',
         'most important aspect is machine learning. machine learning very important essential',
        'i believe machine learning great, machine learning machine learning']
    x = ['a','b','c']
    practice = pd.DataFrame(data=y,index=x,columns=['text'])
    
    
    word_frequency = []
    
    for line in practice["text"]:
        words = line.split()     #this will create a list of all the words in each line
        words_counter = Counter(words)    #This will count the words and number of occurances
        top_word = words_counter.most_common(1)[0][1]    #return the number of the first most frequent word in the list
        word_frequency.append(top_word)     #append the word to the empty list
    
    practice["Word Frequency"] = word_frequency     #add the list as a new column in the dataframe
    print(practice)
    

    Please refer to the Counter documentation for more details https://docs.python.org/2/library/collections.html#collections.Counter