There is two string, str1 is pattern, str2 is a long text
str1 = 'how to do this weird task'
str2 = 'once upon a time...and smth long'
How to find out if str2 contains str1 or something similar to it - not necessarily equal to str1
Now i use Levenshtain.ratio
, a window with a length of str1 above str2.
res = [[str2[i:i+len(str1)],str1,ratio(str2[i:i+len(str1)],str1)] for i in range(len(str2)-len(str1))]
and choose maximum in res[:,2]
, but maybe smth better was created
You can use tokenization, something lilke this:
from fuzzywuzzy import fuzz
import re
str1 = 'how to do this weird task'
str2 = 'Once upon a time, there was a person who wanted to know how to accomplish this weird task.'
str1 = str1.lower()
str2 = str2.lower()
pattern_words = re.findall(r'\w+', str1)
best_match = None
best_ratio = 0
for i in range(len(str2)):
text_words = re.findall(r'\w+', str2[i:])
if len(text_words) < len(pattern_words):
break
ratios = [fuzz.ratio(w, text_words[j]) for j, w in enumerate(pattern_words)]
avg_ratio = sum(ratios) / len(ratios)
if avg_ratio > best_ratio:
best_match = ' '.join(text_words[:len(pattern_words)])
best_ratio = avg_ratio
threshold_ratio = 80
if best_ratio >= threshold_ratio:
print(f"Found a match: '{best_match}' (ratio={best_ratio})")
else:
print("No match found")
Output:
Found a match: 'how to accomplish this weird task' (ratio=86.16666666666667)