Search code examples
phputf-8character-encodingpreg-replacewindows-server-2008

Replacing empty space with preg_replace causes invalid characters with UTF-8


Our PHP web application (PHP 5.6.30 running on Windows Server 2008 R2) uses UTF-8 encoding but needs to import data from files that are encoded using Windows-1252. When the data is imported it is converted to UTF-8 as follows.

iconv('Windows-1252', 'UTF-8', $value);

When we import the following sample data, the conversion works correctly for most of the Windows-1252 characters, but in line 8 below, the à character gives problems and is not correctly converted.

1;€
2;é
3;è
4;ë
5;ï
6;ä
7;á
8;à
9;ç
10;ß
11;ø 
12;í
13;ì
14;ñ
15;@
16;û

Here is a screenshot showing the result of displaying this data on the website.

enter image description here

Does anyone know why the PHP iconv is not correctly converting the à character?


Solution

  • I resolved this issue and it ended up having nothing to do with iconv like I initially thought. The change that was required was such a small one, only one character, but it took me ages to hunt this down. It turns out that the offending statement was actually the following:

    preg_replace('/\s+/', ' ',$columnvalue))
    

    The purpose of this regular expression is to remove white space from the value, but because the encoding was UTF-8 this regular expression had a residual effect of corrupting the à character. I resolved this but adding u (unicode modifier) to the end of the regular expression definition. So the expression became:

    preg_replace('/\s+/u', ' ',$columnvalue))
    

    And then the encoding of the page was correct.