Search code examples
javascriptregexunicodeemoji

Why do Unicode emoji property escapes match numbers?


I found this awesome way to detect emojis using a regex that doesn't use "huge magic ranges" by using a Unicode property escape:

console.log(/\p{Emoji}/u.test('flowers 🌼🌺🌸')) // true
console.log(/\p{Emoji}/u.test('flowers')) // false

But when I shared this knowledge in this answer, @Bronzdragon noticed that \p{Emoji} also matches numbers! Why is that? Numbers are not emojis?

console.log(/\p{Emoji}/u.test('flowers 123')) // unexpectdly true

// regex-only workaround by @Bonzdragon
const regex = /(?=\p{Emoji})(?!\p{Number})/u;
console.log(
  regex.test('flowers'), // false, as expected
  regex.test('flowers 123'), // false, as expected
  regex.test('flowers 123 🌼🌺🌸'), // true, as expected
  regex.test('flowers 🌼🌺🌸'), // true, as expected
)

// more readable workaround
const hasEmoji = str => {
  const nbEmojiOrNumber = (str.match(/\p{Emoji}/gu) || []).length;
  const nbNumber = (str.match(/\p{Number}/gu) || []).length;
  return nbEmojiOrNumber > nbNumber;
}
console.log(
  hasEmoji('flowers'), // false, as expected
  hasEmoji('flowers 123'), // false, as expected
  hasEmoji('flowers 123 🌼🌺🌸'), // true, as expected
  hasEmoji('flowers 🌼🌺🌸'), // true, as expected
)


Solution

  • NOTE: To match any Emoji character in the contemporary JavaScript code, you may use

    // EXTRACT:
    console.log( 'flowers 🌼🌺🌸'.match(/\p{RGI_Emoji}/vg) ); // => ['🌼', '🌺', '🌸']
    // TEST IF PRESENT:
    console.log( /\p{RGI_Emoji}/v.test('flowers 🌼🌺🌸') ); // => true
    // COUNT:
    console.log( 'flowers 🌼🌺🌸'.match(/\p{RGI_Emoji}/vg).length ); // => 3
    

    The answer to the current question

    According to this post, digtis, #, *, ZWJ and some more chars contain the Emoji property set to Yes, which means digits are considered valid emoji chars:

    0023          ; Emoji_Component      #  1.1  [1] (#️)       number sign
    002A          ; Emoji_Component      #  1.1  [1] (*️)       asterisk
    0030..0039    ; Emoji_Component      #  1.1 [10] (0️..9️)    digit zero..digit nine
    200D          ; Emoji_Component      #  1.1  [1] (‍)        zero width joiner
    20E3          ; Emoji_Component      #  3.0  [1] (⃣)       combining enclosing keycap
    FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
    1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (🇦..🇿)    regional indicator symbol letter a..regional indicator symbol letter z
    1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (🏻..🏿)    light skin tone..dark skin tone
    1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (🦰..🦳)    red-haired..white-haired
    E0020..E007F  ; Emoji_Component      #  3.1 [96] (󠀠..󠁿)      tag space..cancel tag
    

    For example, 1 is a digit, but it becomes an emoji when combined with U+FE0F and U+20E3 chars: 1️⃣:

    console.log("1\uFE0F\u20E3 2\uFE0F\u20E3 3\uFE0F\u20E3 4\uFE0F\u20E3 5\uFE0F\u20E3 6\uFE0F\u20E3 7\uFE0F\u20E3 8\uFE0F\u20E3 9\uFE0F\u20E3 0\uFE0F\u20E3")