Search code examples
pythonalgorithmnlp

How to find common words or sentences or paragraphs ,from multiple paragraphs


I have the following sample paragraphs:

para1 = "this is para one. I am cat. I am 10 years old. I like fish"
para2 = "this is para two. I am dog. my age is 12. I can swim"
para3 = "this is para three. I am cat. I am 9 years. I like rat"
para4 = "this is para four. I am rat. my age is secret. I hate cat"
para5 = "this is para five. I am dog. I am 10 years old. I like fish"

need results as below:

this is para

I am

I 

I have tried python's SET data type, but the effect is not ideal.

Is there a binary executable program that allows me to construct a command line to complete my task?


Solution

  • hi you can do something like below

    paragraph_lst = ["this is para one. I am cat. I am 10 years old. I like fish",
                         "this is para two. I am dog. my age is 12. I can swim",
                         "this is para three. I am cat. I am 9 years. I like rat",
                         "this is para four. I am rat. my age is secret. I hate cat",
                         "this is para five. I am dog. I am 10 years old. I like fish"]
        
        word_combinations = set()
        
        
        def get_combinations(line1, line2, first=0, last=1, prvs_wrd=""):
            line_lst = line1.split(" ")
            if last > len(line_lst):
                return
            chk_list = line_lst[first:last]
            wrd = " ".join(str(x) for x in chk_list)
            if wrd in line2:
                prvs_wrd = wrd
                get_combinations(line1, line2, first, last + 1, prvs_wrd)
            else:
                word_combinations.add(prvs_wrd)
                get_combinations(line1, line2, last, last + 1, prvs_wrd)
        
        
        if __name__ == '__main__':
            for n, line in enumerate(paragraph_lst):
                if n + 1 < len(paragraph_lst):
                    str1 = paragraph_lst[n]
                    str2 = paragraph_lst[n + 1]
                    get_combinations(str1, str2)
            print(word_combinations)
    

    thus the set word_combinations will gives the result below

    {'I', 'I am', 'is', 'this is para'}