How to match ZWSP (Zero-width space) encoded as UTF8

I'm having some trouble matching/replacing the ZWSP unicode encoded as UTF8

ZWSP: \x20\x0B
ZWSP (UTF8): \xE2\x80\x8B

As an extra test case I have used NBSP (Non-breaking space) which works as expected

All preg_replace are in UTF8 mode /u

When matching NBSP it works as expected. The input is encoded as UTF8 and the output is empty (NBSP unicode replaced with an empty string)
When matching ZWSP it only works if the ZWSP input is not UTF8 encoded.
If you change the ZWSP pattern to the UTF8 encoded version and keep input as UTF8 it doesn't work either

Q: Then how to match ZWSP in UTF8 ?

... or is this a bug?

code

$nbsp       = '\xA0'; // Non-breaking space
$zwsp       = '\x20\x0B'; // Zero-width space
$zwsp_utf8  = '\xE2\x80\x8B';

$input_nbsp_utf8    = "\xC2\xA0";
$input_zwsp         = "\x20\x0B";
$input_zwsp_utf8    = "\xE2\x80\x8B";

// NBSP
echo "NBSP\n-----\n";
echo "in: $input_nbsp_utf8--\nhex: ".bin2hex($input_nbsp_utf8)."\n";
$output = preg_replace('/'.$nbsp.'/u', '', $input_nbsp_utf8);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";

// ZWSP (input: **not** UTF8)
echo "ZWSP (input: **not** UTF8)\n-----\n";
echo "in: $input_zwsp--\nhex: ".bin2hex($input_zwsp)."\n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";

// ZWSP (input: UTF8)
echo "ZWSP (input: UTF8)\n-----\n";
echo "in: $input_zwsp_utf8--\nhex: ".bin2hex($input_zwsp_utf8)."\n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp_utf8);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";

// ZWSP (pattern: UTF8, input: UTF8)
echo "ZWSP (pattern: UTF8, input: UTF8)\n-----\n";
echo "in: $input_zwsp_utf8--\nhex: ".bin2hex($input_zwsp_utf8)."\n";
$output = preg_replace('/'.$zwsp_utf8.'/u', '', $input_zwsp_utf8);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";

Output

NBSP
-----
in:  --
hex: c2a0
out: --
hex:

ZWSP (input: **not** UTF8)
-----
in:
     --
hex: 200b
out: --
hex:

ZWSP (input: UTF8)
-----
in: --
hex: e2808b
out: --
hex: e2808b // Output should be empty

ZWSP (pattern: UTF8, input: UTF8)
-----
in: --
hex: e2808b
out: --
hex: e2808b // Output should be empty

Solution

Like many people, you seem to be confused about what UTF-8 is. UTF-8 isn't a setting which is on or off, it is one of many different ways of turning text into binary data, and interpreting that binary data to get back the text.

I'm not sure where \x20\x0B came from, or what it has to do with anything, but saying something is "not UTF-8" is like saying a word is "not French", or a piece of meat is "not chicken".

Ignoring that part, let's look at the key piece of code:

$input_zwsp_utf8 = "\xE2\x80\x8B";
$output = preg_replace('/\xE2\x80\x8B/u', '', $input_zwsp_utf8);

You have provided the /u modifier, about which the manual says:

Pattern and subject strings are treated as UTF-8.

Then you've matched using the \xhh notation, which is described under escape sequences:

After "\x", up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, "\x{...}" is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.

This is a bit confusing, but it's saying that normally, \xE2 would match the binary byte E2, i.e. 11100010; but with /u active, it will instead match the Unicode code point U+00E2, which is "Latin Small Letter a With Circumflex".

Example:

$input = 'â';

echo "in: $input\nhex: ".bin2hex($input)."\n";
$output = preg_replace('/\xE2/u', '', $input);
echo "out: $output\nhex: ".bin2hex($output)."\n\n";

Output:

in: â
hex: c3a2
out: 
hex:

What it won't match is Unicode Code Point U+200B, "Zero-Width Space".

So, either treat your string as binary, don't use the /u modifier, and look for the expected string of bytes:

$input_zwsp_utf8 = "\xE2\x80\x8B";
$output = preg_replace('/\xE2\x80\x8B/', '', $input_zwsp_utf8);

Or, treat your string as UTF-8, and look for the code point you're interested in:

$input_zwsp_utf8 = "\xE2\x80\x8B";
$output = preg_replace('/\x{200B}/u', '', $input_zwsp_utf8);

[Live Demo]