Search code examples
pythonstring-comparison

Is there a way to detect if two texts are relating to the same brand?


Given two texts, for example: "Cocacola" and "Coca-cola", is there a way that is generalizable for other brands and other texts to detect if the two texts are relating to the same brand?

Right now I have this simple code:

def matches_company_name(name1: str, name2: str):
    name1_lower_case = name1.lower()
    name2_lower_case = name2.lower()
    return (
        name1_lower_case in name2_lower_case
        or name2_lower_case in name1_lower_case
    )

I have a couple ideas to add to the possible tests:

  • Make a POS tagging and compare ORGs
  • Make separation by spaces or dashes and check if are parts in common. (Maybe to general)
  • Making some kind of edit distance threshold (Maybe will fail for strings that differ a lot in the length)

Is there already a way to achieve this? Do you have any good ideas (heuristics) to add to the tests?


Solution

  • I ended up using fuzzywuzzy package, specially partial_token_set_ratio which gave very good results.

    Usage example:

    from fuzzywuzzy import fuzz
    
    
    MIN_SCORE = 0.7
    s1 = "cocacola"
    s2 = "coca-cola"
    
    is_same_company = fuzz.partial_token_set_ratio(s1, s2) > MIN_SCORE
    print(is_same_company)
    # True