Search code examples
pythonstringcompare

Comparing string uniforming special characters in python


Probabily I can use a better english but what I want is ignoring accent (and like) in words so:

renè, rené, rene' and rene should be the same so should

mañana and manana or

even-distribuited and even distribuited and possibly

shouldn't and shouldnt

I remember a function (derivated from journalism) used for example for internet page addresses that should take out spaces, accent etc but I don't remember the name. I think it should works but other way are accepted

Thank you

Edit:

The function I had in mind is Slugfy() for Django but probabily is not enough


Solution

  • The standard approach to get rid of special chars seems to be discussed in this question. But maybe you could consider another approach often called fuzzy matching (or fuzzy search).

    [...] technique of finding strings that match a pattern approximately (rather than exactly)

    In Python you can use TheFuzz to do that. Here is a try based on your examples.

    from thefuzz import fuzz
    
    tuples = [("mañana", "manana"), ("shouldn't", "shouldnt"), ("even-distribuited", "even distribuited")]
    
    for tuple in tuples:
      print(f"{tuple[0]} vs {tuple[1]}: {fuzz.ratio(tuple[0], tuple[1])}")
    
    # mañana vs manana: 83
    # shouldn't vs shouldnt: 94
    # even-distribuited vs even distribuited: 94
    

    So you could define a rule based on the ratio to conclude that there is a match between two strings.


    You could even combine unicode normalization and fuzzy matching for better results.

    tuples = [("mañana", "manana"), ("shouldn't", "shouldnt"), ("even-distribuited", "even distribuited")]
    
    def compare(tuples, unicode=True):
      for t in tuples:
        if unicode:
          t = tuple(map(lambda x: unicodedata.normalize(u'NFKD', x).encode('ascii', 'ignore').decode('utf8'), t))
        print(f"{t[0]} vs {t[1]}: {fuzz.ratio(t[0], t[1])}")
    
    compare(tuples)
    
    # manana vs manana: 100
    # shouldn't vs shouldnt: 94
    # even-distribuited vs even distribuited: 94