Search code examples
pythonpandascountersimilarity

Find repeated sentences within text


I would like to know how I could find similarity within the same sentence. I have a list of sentences like these:

my_list=["do you want pizza for dinner? Do you want pizza for dinner?", "I like pizza", "I have no money I have no money"]

I would like to create a pandas dataframe where, if a sentence is repeated within the same, I assign 1, otherwise 0.

Something like this:

Text                                                              Repeated?
do you want pizza for dinner? Do you want pizza for dinner?            1
I like pizza                                                           0
I have no money I have no money                                        1

I was thinking of something like this:

from collections import Counter


my_list = dict(Counter(my_list.split()))
for i in sorted(my_list.keys()):
    print ('"'+i+'" is repeated '+str(my_list[i])+' time.')

Then counting how many words there are in total and how many unique words there are in total in that sentence. But I think it would be not good as coding. Do you know if there is another way to get the expected result?


Solution

  • You can use regular expression for the task (regex101):

    import re
    import pandas as pd
    
    my_list=["do you want pizza for dinner? Do you want pizza for dinner?", "I like pizza", "I have no money I have no money"]
    df = pd.DataFrame({'Text': my_list})
    
    r = re.compile(r'(.+)\s*\1$', flags=re.I)
    df['Repeated'] = df['Text'].apply(lambda x: bool(r.match(x))).astype(int) 
    print(df)
    

    Prints:

                                                    Text  Repeated
    0  do you want pizza for dinner? Do you want pizz...         1
    1                                       I like pizza         0
    2                    I have no money I have no money         1