Search code examples
phpregexencodingspecial-characters

Regex, encoding, and characters that look a like


First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.

Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.

Does anyone knows how I might solve this?


Solution

  • I just stumbled into good references for this question:

    http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt

    https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize

    https://www.rfc-editor.org/rfc/rfc3454.html