Search code examples
pythonnltk

Compare strings, find part that is present in each string


How do I compare several rows and find words/combination of words that are present in each row? Using pure python, nltk or anything else.

few_strings = ('this is foo bar', 'this is not a foo bar', 'some other foo bar here')
# some magic
result = 'foo bar'

Solution

  • You might want to use the standard library difflib for sequence comparisons including finding common substrings:

    from difflib import SequenceMatcher
    
    list_of_str = ['this is foo bar', 'this is not a foo bar', 'some other foo bar here']
    
    result = list_of_str[0]
    for next_string in list_of_str:
        match = SequenceMatcher(None, result, next_string).find_longest_match()
        result = result[match.a:match.a + match.size]
    
    # result be 'foo bar'
    
    from difflib import SequenceMatcher
    
    string1 = "apple pie available"
    string2 = "come have some apple pies"
    
    match = SequenceMatcher(None, string1, string2).find_longest_match()
    
    print(match)  # -> Match(a=0, b=15, size=9)
    print(string1[match.a:match.a + match.size])  # -> apple pie
    print(string2[match.b:match.b + match.size])  # -> apple pie