Search code examples
javascriptregexunicodecharacter-properties

how to use unicode character groups in javascript's regexs?


there is a way to use patterns like "\p{L}" in javascript, natively?

(i suppose that is a perl-compatible syntax)

I'm interested firstly in firefox support, and webkit, possibly


Solution

  • Unfortunately, no. You can only specify a set of characters in the usual syntax, writing characters and ranges in brackets, but this becomes awkward since e.g. letters are scattered all around the Unicode space, with other characters between them.

    There’s an inefficient workaround: fetch the UnicodeData.txt file from the Unicode site, put its content inside your JavaScript code as data, and parse it. And then you could have the data e.g. in an array of objects containing the Unicode properties, such as gc (General Category), which tells you whether the character is a letter or not. But even then, you would just have the data handy for simple testing, not as something you can use as a constituent of a regexp.

    In theory, you could use the data to construct a regexp... but it would be rather large.