Probabily I can use a better english but what I want is ignoring accent (and like) in words so:
renè
, rené
, rene'
and rene
should be the same so should
mañana
and manana
or
even-distribuited
and even distribuited
and possibly
shouldn't
and shouldnt
I remember a function (derivated from journalism) used for example for internet page addresses that should take out spaces, accent etc but I don't remember the name. I think it should works but other way are accepted
Thank you
Edit:
The function I had in mind is Slugfy() for Django but probabily is not enough
The standard approach to get rid of special chars seems to be discussed in this question. But maybe you could consider another approach often called fuzzy matching (or fuzzy search).
[...] technique of finding strings that match a pattern approximately (rather than exactly)
In Python you can use TheFuzz to do that. Here is a try based on your examples.
from thefuzz import fuzz
tuples = [("mañana", "manana"), ("shouldn't", "shouldnt"), ("even-distribuited", "even distribuited")]
for tuple in tuples:
print(f"{tuple[0]} vs {tuple[1]}: {fuzz.ratio(tuple[0], tuple[1])}")
# mañana vs manana: 83
# shouldn't vs shouldnt: 94
# even-distribuited vs even distribuited: 94
So you could define a rule based on the ratio to conclude that there is a match between two strings.
You could even combine unicode normalization and fuzzy matching for better results.
tuples = [("mañana", "manana"), ("shouldn't", "shouldnt"), ("even-distribuited", "even distribuited")]
def compare(tuples, unicode=True):
for t in tuples:
if unicode:
t = tuple(map(lambda x: unicodedata.normalize(u'NFKD', x).encode('ascii', 'ignore').decode('utf8'), t))
print(f"{t[0]} vs {t[1]}: {fuzz.ratio(t[0], t[1])}")
compare(tuples)
# manana vs manana: 100
# shouldn't vs shouldnt: 94
# even-distribuited vs even distribuited: 94