Given two texts, for example: "Cocacola" and "Coca-cola", is there a way that is generalizable for other brands and other texts to detect if the two texts are relating to the same brand?
Right now I have this simple code:
def matches_company_name(name1: str, name2: str):
name1_lower_case = name1.lower()
name2_lower_case = name2.lower()
return (
name1_lower_case in name2_lower_case
or name2_lower_case in name1_lower_case
)
I have a couple ideas to add to the possible tests:
Is there already a way to achieve this? Do you have any good ideas (heuristics) to add to the tests?
I ended up using fuzzywuzzy
package, specially partial_token_set_ratio
which gave very good results.
from fuzzywuzzy import fuzz
MIN_SCORE = 0.7
s1 = "cocacola"
s2 = "coca-cola"
is_same_company = fuzz.partial_token_set_ratio(s1, s2) > MIN_SCORE
print(is_same_company)
# True