Search code examples
phpencodingutf-8multibyte

Same encoding (UTF-8), but different lengths of string and content (PHP)


I have two string variables - first variable is set manually inside code ($date1="14 июня"), second one parsed from remote page using cURL and phpQuery. If we print both variables, the result looks the same, but length and content are different:

echo $date1; //output: 14 июня
echo $date2; //output: 14 июня
echo $date1[2]; //output is space - third symbol in string
echo $date2[2]; //output is � - it's a part of third symbol in string
echo strlen($date1); //output: 7
echo strlen($date2); //output: 12
echo mb_detect_encoding($date1) //output: UTF-8
echo mb_detect_encoding($date2) //output: UTF-8

I wonder if there is a solution how to convert $date2 to format/encoding of $date1?

p.s: There is SO topic about iconv(), but I'm unable to find working solution.


Solution

  • So you have 2 strings:

    313420d0b8d18ed0bdd18f - this uses 0x20 character as a space.

    3134c2a0d0b8d18ed0bdd18f - this uses the 0xC2A0 sequence of bytes as a space (it's the Unicode's non-breaking space).

    Apart of those spaces the strings are identical.

    To replace the space-alike unicode characters with a regular space you can use the following regular expression:

    preg_replace('~\p{Zs}~u', ' ', $str)
    

    References: