Search code examples
pythonscikit-learnword-frequencycountvectorizer

Return the list of each word in a pandas cell and the total count of that word in the entire column


I have a pandas data frame, df which looks like this:

             column1
0   apple is a fruit
1        fruit sucks
2  apple tasty fruit
3   fruits what else
4      yup apple map
5   fire in the hole
6       that is true

I want to produce a column2, which is the list of each word in the row and the total count of each word in the entire column. So the output would be something like this....

    column1            column2
0   apple is a fruit   [('apple', 3),('is', 2),('a', 1),('fruit', 3)]
1        fruit sucks   [('fruit', 3),('sucks', 1)]

I tried using the sklearn, but failing to achieve the above. Need help.

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x = v.fit_transform(df['text'])

Solution

  • Here is one way that gives the result you want, although avoids sklearn entirely:

    def counts(data, column):
        full_list = []
        datr = data[column].tolist()
        total_words = " ".join(datr).split(' ')
        # per rows
        for i in range(len(datr)):
            #first per row get the words
            word_list = re.sub("[^\w]", " ",  datr[i]).split()
            #cycle per word
            total_row = []
            for word in word_list:
                count = []
                count = total_words.count(word)
                val = (word, count)
                total_row.append(val)
            full_list.append(total_row)
        return full_list
    
    df['column2'] = counts(df,'column1')
    df
             column1                                    column2
    0   apple is a fruit  [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
    1        fruit sucks                   [(fruit, 3), (sucks, 1)]
    2  apple tasty fruit       [(apple, 3), (tasty, 1), (fruit, 3)]
    3   fruits what else        [(fruits, 1), (what, 1), (else, 1)]
    4      yup apple map           [(yup, 1), (apple, 3), (map, 1)]
    5   fire in the hole  [(fire, 1), (in, 1), (the, 1), (hole, 1)]
    6       that is true            [(that, 1), (is, 2), (true, 1)]