Search code examples
pythonpandascountnltk

Count the frequency of 2 words combination in all the rows of a column


I want to count the frequency of 2 words combination in all the rows of a column.

I have a table with two columns - The first is a column with a sentence while the other is the bigram tokenization of that sentence.

Sentence words
'beautiful day suffered through ' 'beautiful day'
'beautiful day suffered through ' 'day suffered'
'beautiful day suffered through ' 'suffered through'
'cannot hold back tears ' 'cannot hold'
'cannot hold back tears ' 'hold back'
'cannot hold back tears ' 'back tears'
'ash back tears beautiful day ' 'ash back'
'ash back tears beautiful day ' 'back tears'
'ash back tears beautiful day ' 'tears beautiful'
'ash back tears beautiful day ' 'beautiful day'

My desired output is a column counting the frequency of the words in all the sentences throughout the whole df['Sentence'] column. Something like this:

Sentence Words Total
'beautiful day suffered through ' 'beautiful day' 2
'beautiful day suffered through ' 'day suffered' 1
'beautiful day suffered through ' 'suffered through' 1
'cannot hold back tears ' 'cannot hold' 1
'cannot hold back tears ' 'hold back' 1
'cannot hold back tears ' 'back tears' 2
'ash back tears beautiful day ' 'ash back' 1
'ash back tears beautiful day ' 'back tears' 2
'ash back tears beautiful day ' 'tears beautiful' 1
'ash back tears beautiful day ' 'beautiful day' 2

and so on.

The code I have tried repeats the first same frequency until the end of the sentence.

df.Sentence.str.count('|'.join(df.words.tolist()))

So not what I am looking for and it also takes a very long time as my original df is much larger.

Is there any alternative or any function in the NLTK or any other library?


Solution

  • I suggest:

    • Start by removing the quotes and whitespaces at the beginning and end of both Sentences and words
    data = data.apply(lambda x: x.str.replace("'", ""))
    data["Sentence"] = data["Sentence"].str.strip()
    data["words"] = data["words"].str.strip()
    
    • Then set Sentences and words as string objects:
    data = data.astype({"Sentence":str, "words": str})
    print(data)
    
    #Output
                              Sentence            words
    0   beautiful day suffered through     beautiful day
    1   beautiful day suffered through      day suffered
    2   beautiful day suffered through  suffered through
    3           cannot hold back tears       cannot hold
    4           cannot hold back tears         hold back
    5           cannot hold back tears        back tears
    6     ash back tears beautiful day          ash back
    7     ash back tears beautiful day        back tears
    8     ash back tears beautiful day   tears beautiful
    9     ash back tears beautiful day     beautiful day
    
    • Count the occurrence of the given words in the sentence on the same row and store in a column e.g words_occur
    def words_in_sent(row):
        return row["Sentence"].count(row["words"])
    data["words_occur"] = data.apply(words_in_sent, axis=1)
    
    • Finally groupby words and sum up their occurrences:
    data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
    print(data)
    

    Result

                              Sentence          words    words_occur total
    0   beautiful day suffered through     beautiful day           1     2
    1   beautiful day suffered through      day suffered           1     1
    2   beautiful day suffered through  suffered through           1     1
    3           cannot hold back tears       cannot hold           1     1
    4           cannot hold back tears         hold back           1     1
    5           cannot hold back tears        back tears           1     2
    6     ash back tears beautiful day          ash back           1     1
    7     ash back tears beautiful day        back tears           1     2
    8     ash back tears beautiful day   tears beautiful           1     1
    9     ash back tears beautiful day     beautiful day           1     2