Search code examples
phpunicodepcreemojicodepoint

In PHP PCRE syntax, how does one specify a multi-codepoint Unicode character/"emoji"?


Code:

var_dump(preg_replace('#\x{1F634}#u', '', 'This is the sleeping emoji: 😴'));
var_dump(preg_replace('#\x{1F1FB 1F1F3}#u', '', 'This is the Vietnam flag: 🇻🇳'));

Expected output:

string(28) "This is the sleeping emoji: "
string(33) "This is the Vietnam flag: "

Actual output:

string(28) "This is the sleeping emoji: "
string(34) "This is the Vietnam flag: 🇻🇳  "

Analysis:

The one-codepoint emoji is successfully removed, but the multi-codepoint emoji is not detected.

Research performed:

Read the following on: https://www.php.net/manual/en/regexp.reference.escape.php

After "\x", up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, "\x{...}" is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.

Unfortunately, it does not mention multi-codepoint Unicode characters.

Question:

How to specify a multi-codepoint emoji/Unicode character in PHP PCRE syntax?

Helpful note:

It is not a range! I am able to detect and remove ranges. This is a single emoji/Unicode character consisting of multiple "codepoints". There are quite a few of those specified here: https://www.unicode.org/Public/emoji/13.1/emoji-sequences.txt


Solution

  • You quote the passage which says something like \x{...] "is interpreted as a UTF-8 character". The wording is slightly weird, because it is a Unicode codepoint in UTF-8 rather than a character, but since you need two codepoints, you also need two such sequences:

    var_dump(preg_replace('#\x{1F1FB}\x{1F1F3}#u', '', 'This is the Vietnam flag: 🇻🇳'));