I want to count the frequency of 2 words combination in all the rows of a column.
I have a table with two columns - The first is a column with a sentence while the other is the bigram tokenization of that sentence.
Sentence | words |
---|---|
'beautiful day suffered through ' | 'beautiful day' |
'beautiful day suffered through ' | 'day suffered' |
'beautiful day suffered through ' | 'suffered through' |
'cannot hold back tears ' | 'cannot hold' |
'cannot hold back tears ' | 'hold back' |
'cannot hold back tears ' | 'back tears' |
'ash back tears beautiful day ' | 'ash back' |
'ash back tears beautiful day ' | 'back tears' |
'ash back tears beautiful day ' | 'tears beautiful' |
'ash back tears beautiful day ' | 'beautiful day' |
My desired output is a column counting the frequency of the words in all the sentences throughout the whole df['Sentence'] column. Something like this:
Sentence | Words | Total |
---|---|---|
'beautiful day suffered through ' | 'beautiful day' | 2 |
'beautiful day suffered through ' | 'day suffered' | 1 |
'beautiful day suffered through ' | 'suffered through' | 1 |
'cannot hold back tears ' | 'cannot hold' | 1 |
'cannot hold back tears ' | 'hold back' | 1 |
'cannot hold back tears ' | 'back tears' | 2 |
'ash back tears beautiful day ' | 'ash back' | 1 |
'ash back tears beautiful day ' | 'back tears' | 2 |
'ash back tears beautiful day ' | 'tears beautiful' | 1 |
'ash back tears beautiful day ' | 'beautiful day' | 2 |
and so on.
The code I have tried repeats the first same frequency until the end of the sentence.
df.Sentence.str.count('|'.join(df.words.tolist()))
So not what I am looking for and it also takes a very long time as my original df is much larger.
Is there any alternative or any function in the NLTK or any other library?
I suggest:
Sentences
and words
data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()
Sentences
and words
as string objects:data = data.astype({"Sentence":str, "words": str})
print(data)
#Output
Sentence words
0 beautiful day suffered through beautiful day
1 beautiful day suffered through day suffered
2 beautiful day suffered through suffered through
3 cannot hold back tears cannot hold
4 cannot hold back tears hold back
5 cannot hold back tears back tears
6 ash back tears beautiful day ash back
7 ash back tears beautiful day back tears
8 ash back tears beautiful day tears beautiful
9 ash back tears beautiful day beautiful day
words_occur
def words_in_sent(row):
return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)
words
and sum up their occurrences:data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)
Result
Sentence words words_occur total
0 beautiful day suffered through beautiful day 1 2
1 beautiful day suffered through day suffered 1 1
2 beautiful day suffered through suffered through 1 1
3 cannot hold back tears cannot hold 1 1
4 cannot hold back tears hold back 1 1
5 cannot hold back tears back tears 1 2
6 ash back tears beautiful day ash back 1 1
7 ash back tears beautiful day back tears 1 2
8 ash back tears beautiful day tears beautiful 1 1
9 ash back tears beautiful day beautiful day 1 2