Search code examples
phpunicodeencodingpreg-replacescandir

Encoding problem with preg_replace() and scandir()


On OS-X (PHP5.2.11) I have a file: siësta.doc (and thousand other with Unicode filenames) and I want to convert the file names to a web-consumable format (a-zA-Z0-9.). If I hardcode the file name above I can do the right conversion:

<?php
  $file = 'siësta.doc';
  echo preg_replace("/[^a-zA-Z0-9.]/u", '_', $file);
  // Output: si_sta.doc
?>

But if I read the file names with scandir, I've got strange conversions:

<?php
  $files = scandir(DIRNAME);
  foreach ($files as $file) {
    echo preg_replace("/[^a-zA-Z0-9.]/u", '_', $file);
    // Output for the file above: sie_sta.doc
  }
?>

I tried to detect the encoding, set the encoding, convert it with iconv functions. I tried the mb_ functions also. But it was just worse. What did I do wrong?

Thanks in advance


Solution

  • Interesting. After a bit recherché i've found that OSX stores filenames as "decomposed unicode" (see http://developer.apple.com/mac/library/qa/qa2001/qa1173.html). That is, "ë" is represented as "e" + diaresis symbol (0xcc88).