I am trying to match all latin characters in UTF 16 encoded text. I have been using [A-Za-z] which has been working great. As I've been parsing chinese and japanese text I've been coming across bizarre versions of A-Z that the regex isn't picking up.
https://gist.github.com/kyleect/1c66fd388d362653969d
Left are the characters I can't identify, right is from my keyboard. I copy and pasted them in to chrome page find input, google search and the find input in my text editor. All agree: Left == Right
but Right != Left
What are these characters and wow do I target them in regex?
You can take a look at their character codes in your browser’s console:
> 'B'.charCodeAt(0).toString(16)
ff22
It’s a fullwidth letter! You can probably match the whole set with [\uff21-\uff3a]
in a decent regex engine. Or A-Z
in an even more decent one.