Search code examples
linuxmacosutf-8diacriticsdead-key

Different UTF-8 signature for same diacritics (umlauts) - 2 binary ways to write umlauts


I have a quite big problem, where I can't find any help around in the web:

I moved a page from a website from OSX to Linux (both systems are running in de_DE.UTF-8) and run in an quite unknown problem: Some of the files were not found anymore, but obviously existed on the harddrive with (visibly) the same name. All those files contained german umlauts.

I took one sample image, copied the original request-uri from the webpage and called it directly - same error. After rewriting the file-name it worked. And yes, I did not mistype it!

This surprised me and I took a look into the apache-log where I found these entries:

192.168.56.10 - - [27/Aug/2012:20:03:21 +0200] "GET /images/Sch%C3%B6ne-Lau-150x150.jpg HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
192.168.56.10 - - [27/Aug/2012:20:03:57 +0200] "GET /images/Scho%CC%88ne-Lau-150x150.jpg HTTP/1.1" 404 4205 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"

That was something for me to investigate ... Here's what I found in the UTF8 chartable http://www.utf8-chartable.de/:

ö   c3 b6   LATIN SMALL LETTER O WITH DIAERESIS
¨   cc 88   COMBINING DIAERESIS

I think you've already heard of dead-keys: http://en.wikipedia.org/wiki/Dead_key If not, read the article. It's quite interesting ;)

Does that mean, that OSX saves all diacritics separate to the letter? Does that really mean, that OSX saves the character ö as o and ¨ instead of using the real character that results of the combination?

If yes, do you know of a good script that I could use to rename these files? This won't be the first page I move from OSX to Linux ...


Solution

  • Thanks, Jon Hanna for much background-information here! This was important to get the full answer: a way to convert from the one to the other normalisation form.

    As my changes are in the filesystem (because of file-upload) that is linked in the database, I now have to update my database-dump. The files got already renamed during the move (maybe by the FTP-Client ...)

    Command line tools to convert charsets on Linux are:

    • iconv - converting the content of a stream (maybe a file)
    • convmv - converting the filenames in a directory

    The charset utf-8-mac (as described in http://loopkid.net/articles/2011/03/19/groking-hfs-character-encoding), I could use in iconv, seems to exist just on OSX systems and so I have to move my sql-dump to my mac, convert it and move it back. Another option would be to rename the files back using convmv to NFD, but this would more hinder than help in the future, I think.

    The tool convmv has a build-in (os-independent) option to enforcing NFC- or NFD-compatible filenames: http://www.j3e.de/linux/convmv/man/

    PHP itself (the language my system - Wordpress is based on) supports a compatibility-layer here: In PHP, how do I deal with the difference in encoded filenames on HFS+ vs. elsewhere? After I fixed this issue for me, I will go and write some tests and may also write a bug-report to Wordpress and other systems I work with ;)