Search code examples
pythonstringlevenshtein-distance

How to find out if string contains substring or something similar to it


There is two string, str1 is pattern, str2 is a long text

str1 = 'how to do this weird task'
str2 = 'once upon a time...and smth long'

How to find out if str2 contains str1 or something similar to it - not necessarily equal to str1

Now i use Levenshtain.ratio, a window with a length of str1 above str2.

res = [[str2[i:i+len(str1)],str1,ratio(str2[i:i+len(str1)],str1)] for i in range(len(str2)-len(str1))]

and choose maximum in res[:,2], but maybe smth better was created


Solution

  • You can use tokenization, something lilke this:

    from fuzzywuzzy import fuzz
    import re
    
    str1 = 'how to do this weird task'
    str2 = 'Once upon a time, there was a person who wanted to know how to accomplish this weird task.'
    
    str1 = str1.lower()
    str2 = str2.lower()
    
    pattern_words = re.findall(r'\w+', str1)
    
    best_match = None
    best_ratio = 0
    for i in range(len(str2)):
        text_words = re.findall(r'\w+', str2[i:])
        if len(text_words) < len(pattern_words):
            break
        ratios = [fuzz.ratio(w, text_words[j]) for j, w in enumerate(pattern_words)]
        avg_ratio = sum(ratios) / len(ratios)
        if avg_ratio > best_ratio:
            best_match = ' '.join(text_words[:len(pattern_words)])
            best_ratio = avg_ratio
    
    threshold_ratio = 80
    if best_ratio >= threshold_ratio:
        print(f"Found a match: '{best_match}' (ratio={best_ratio})")
    else:
        print("No match found")
    

    Output:

    Found a match: 'how to accomplish this weird task' (ratio=86.16666666666667)