Search code examples
macosperlunicodediacriticsunicode-normalization

Umlauts in OS X file names (perl)


I'm having some troubles with umlauts (ü character) in filenames on OS X. I'm creating the directory from a perl script. Conceptually what I'm doing is:

$NAME = "abcüabc";
$PATH = "/Applications/MyProgram/".$NAME."/";
system('ditto', '--rsrc', $FROMPATH, $PATH . $FILENAME);

This creates the folder with the name "/Applications/MyProgram/abs%9Fabc/".

Anyone know how I can fix this to create the directory with the correct characters?


Solution

  • You have to say:

    use utf8;
    

    in your Perl source if you expect those strings to be interpreted as characters instead of binary.

    % uname -a
    Darwin arwen 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
    
    % cat /tmp/makeit 
    use utf8;
    
    $name = "abcüabc";
    $path = "/tmp/$name";
    
    mkdir($name,0777) || die "can't mkdir $path: $!";
    
    % perl /tmp/makeit
    
    % ls -dF /tmp/abc*
    /tmp/abcüabc/
    

    See? It works just fine if you do that do it.


    EDIT: You’re using MacRoman!

    % macroman 0x9F
    MacRoman 0x9F  ⇒  U+00FC  ‹ü›  \N{LATIN SMALL LETTER U WITH DIAERESIS}
    

    And you cannot have a character U+00FC in the filesystem anyway, because it decomposes to a "u" followed by "\N{COMBINING DIAERESIS}". Did you actually enter MacRoman characters in your Perl source code? However did you do THAT? Please convert to Unicode!! Perl has no idea that your source code is in legacy MacRoman! U+009F is a control code meaning "\N{APPLICATION PROGRAM COMMAND}".

    Here, watch:

    % cat /tmp/makeit
    use utf8;
    
    $name = "abcüabc";
    $path = "/tmp/$name";
    
    mkdir($name,0777) || die "can't mkdir $path: $!";
    
    % uniquote /tmp/makeit
    use utf8;
    
    $name = "abc\N{U+FC}abc";
    $path = "/tmp/$name";
    
    mkdir($name,0777) || die "can't mkdir $path: $!";
    
    % uniquote -v /tmp/makeit
    use utf8;
    
    $name = "abc\N{LATIN SMALL LETTER U WITH DIAERESIS}abc";
    $path = "/tmp/$name";
    
    mkdir($name,0777) || die "can't mkdir $path: $!";
    
    % uniquote -b /tmp/makeit
    use utf8;
    
    $name = "abc\xC3\xBCabc";
    $path = "/tmp/$name";
    
    mkdir($name,0777) || die "can't mkdir $path: $!";
    
    % perl /tmp/makeit
    
    % ls -Fd /tmp/abc* | uniquote -v
    /tmp/abcu\N{COMBINING DIAERESIS}abc/
    

    You can grab the uniquote program from here. It will show you what is really in the file. You can also get the macroman script.

    You seem to have somehow entered ugly old MacRoman in your Perl code. Please please convert to Unicode!

    % iconv -f MacRoman -t UTF-8 < input > output