Search code examples
javascriptregexunicoderegex-groupunicode-escapes

include unicode character within long regex


I have a regex:

/[a-zA-Zɑôáīúȑìêɑ͡iɑ͡uŋġḧn̐ƞġg̶̃čḣñt́d́ŕŕńȶv̈m̈ᵯǰɏæǽÿẇẏs̃śś̶]+/gm

which works great except there is one character I can't include (or that doesn't seem to work as expected when included). The character is (within) the last digit of the regex:

ś̶ // [it makes the cross-through (not easily visible in some fonts), in unicode it is 'COMBINING LONG STROKE OVERLAY' (U+0336)]

my regex is capturing the character but splitting any word that contains it:

"mokk̇ś̶ḣô".match(/[a-zA-Zɑôáīúȑìêɑ͡iɑ͡uŋġḧn̐ƞġčḣñt́d́ŕŕńȶv̈m̈ᵯǰɏæǽÿẇẏs̃śś̶g̶̃]+/gm)

// == ['mokk', 'ś̶ḣô']

I've heard about Unicode Property Escapes using \p{UnicodePropertyValue} with a u flag. Would that be useful here?


Solution

  • It doesn't seem to be related to ś char. As you said your self, it's being captured. The reason for the splitting is the lack of another char: k̇.

    console.log("mokk̇ś̶ḣô".match(/[a-zA-Zɑôáīúȑìêɑ͡iɑ͡uŋġḧn̐ƞġčḣñt́d́ŕŕńȶv̈m̈ᵯǰɏæǽÿẇẏs̃śś̶g̶̃]+/gm)
    )
    console.log("mokk̇ś̶ḣô".match(/[a-zA-Zɑôáīúȑìêɑ͡iɑ͡uŋġḧn̐ƞġčḣñt́d́ŕŕńȶv̈m̈ᵯǰɏæǽÿẇẏs̃śś̶k̇g̶̃]+/gm)
    )