I'm having some trouble matching/replacing the ZWSP unicode encoded as UTF8
ZWSP: \x20\x0B
ZWSP (UTF8): \xE2\x80\x8B
As an extra test case I have used NBSP (Non-breaking space) which works as expected
All preg_replace
are in UTF8 mode /u
When matching NBSP it works as expected. The input is encoded as UTF8 and the output is empty (NBSP unicode replaced with an empty string)
When matching ZWSP it only works if the ZWSP input is not UTF8 encoded.
If you change the ZWSP pattern to the UTF8 encoded version and keep input as UTF8 it doesn't work either
... or is this a bug?
code
$nbsp = '\xA0'; // Non-breaking space
$zwsp = '\x20\x0B'; // Zero-width space
$zwsp_utf8 = '\xE2\x80\x8B';
$input_nbsp_utf8 = "\xC2\xA0";
$input_zwsp = "\x20\x0B";
$input_zwsp_utf8 = "\xE2\x80\x8B";
// NBSP
echo "NBSP\n-----\n";
echo "in: $input_nbsp_utf8--\nhex: ".bin2hex($input_nbsp_utf8)."\n";
$output = preg_replace('/'.$nbsp.'/u', '', $input_nbsp_utf8);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";
// ZWSP (input: **not** UTF8)
echo "ZWSP (input: **not** UTF8)\n-----\n";
echo "in: $input_zwsp--\nhex: ".bin2hex($input_zwsp)."\n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";
// ZWSP (input: UTF8)
echo "ZWSP (input: UTF8)\n-----\n";
echo "in: $input_zwsp_utf8--\nhex: ".bin2hex($input_zwsp_utf8)."\n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp_utf8);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";
// ZWSP (pattern: UTF8, input: UTF8)
echo "ZWSP (pattern: UTF8, input: UTF8)\n-----\n";
echo "in: $input_zwsp_utf8--\nhex: ".bin2hex($input_zwsp_utf8)."\n";
$output = preg_replace('/'.$zwsp_utf8.'/u', '', $input_zwsp_utf8);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";
Output
NBSP
-----
in: --
hex: c2a0
out: --
hex:
ZWSP (input: **not** UTF8)
-----
in:
--
hex: 200b
out: --
hex:
ZWSP (input: UTF8)
-----
in: --
hex: e2808b
out: --
hex: e2808b // Output should be empty
ZWSP (pattern: UTF8, input: UTF8)
-----
in: --
hex: e2808b
out: --
hex: e2808b // Output should be empty
Like many people, you seem to be confused about what UTF-8 is. UTF-8 isn't a setting which is on or off, it is one of many different ways of turning text into binary data, and interpreting that binary data to get back the text.
I'm not sure where \x20\x0B
came from, or what it has to do with anything, but saying something is "not UTF-8" is like saying a word is "not French", or a piece of meat is "not chicken".
Ignoring that part, let's look at the key piece of code:
$input_zwsp_utf8 = "\xE2\x80\x8B";
$output = preg_replace('/\xE2\x80\x8B/u', '', $input_zwsp_utf8);
You have provided the /u
modifier, about which the manual says:
Pattern and subject strings are treated as UTF-8.
Then you've matched using the \xhh
notation, which is described under escape sequences:
After "\x", up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, "\x{...}" is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.
This is a bit confusing, but it's saying that normally, \xE2
would match the binary byte E2
, i.e. 11100010
; but with /u
active, it will instead match the Unicode code point U+00E2
, which is "Latin Small Letter a With Circumflex".
Example:
$input = 'â';
echo "in: $input\nhex: ".bin2hex($input)."\n";
$output = preg_replace('/\xE2/u', '', $input);
echo "out: $output\nhex: ".bin2hex($output)."\n\n";
Output:
in: â
hex: c3a2
out:
hex:
What it won't match is Unicode Code Point U+200B
, "Zero-Width Space".
So, either treat your string as binary, don't use the /u
modifier, and look for the expected string of bytes:
$input_zwsp_utf8 = "\xE2\x80\x8B";
$output = preg_replace('/\xE2\x80\x8B/', '', $input_zwsp_utf8);
Or, treat your string as UTF-8, and look for the code point you're interested in:
$input_zwsp_utf8 = "\xE2\x80\x8B";
$output = preg_replace('/\x{200B}/u', '', $input_zwsp_utf8);