Search code examples
pythonstringmatchpseudocode

how to find longest match of a string including a focus word in python


new to python/programming, so not quite sure how to phrase this....

What I want to do is this: input a sentence, find all matches of the input sentence and a set of stored sentences/strings, and return the longest combination of matched strings.

I think the answer will have something to do with regex, but I haven't started those yet and didn't want to if i didn't need to.

My question: is regex the way to go about this? or is there a way to do this without importing anything?

if it helps you understand my question/idea, heres pseudocode for what i'm trying to do:

input = 'i play soccer and eat pizza on the weekends'
focus_word = 'and'

ss = [
      'i play soccer and baseball',
      'i eat pizza and apples',
      'every day i walk to school and eat pizza for lunch',
      'i play soccer but eat pizza on the weekend',
     ]

match = MatchingFunction(input, focus_word, ss)
# input should match with all except ss[3]

ss[0]match= 'i play soccer and'
ss[1]match = 'and'
ss[2]match = 'and eat pizza'

#the returned value match should be 'i play soccer and eat pizza'

Solution

  • It sounds like you want to find the longest common substring between your input string and each string in your database. Assuming you have a function LCS that will find the longest common substring of two strings, you could do something like:

    > [LCS(input, s) for s in ss]
    ['i play soccer and ',
     ' eat pizza ',
     ' and eat pizza ',
     ' eat pizza on the weekend']
    

    Then, it sounds like you're looking for the most-repeated substring within your list of strings. (Correct me if I'm wrong, but I'm not quite sure what you're looking for in the general case!) From the array output above, what combination of strings would you use to create your output string?


    Based on your comments, I think this should do the trick:

    > parts = [s for s in [LCS(input, s) for s in ss] if s.find(focus_word) > -1]
    > parts
    ['i play soccer and ', ' and eat pizza ']
    

    Then, to get rid of the duplicate words in this example:

    > "".join([parts[0]] + [p.replace(focus_word, "").strip() for p in parts[1:]])
    'i play soccer and eat pizza'