Search code examples
macosfilegounicodefilesystems

How do you distinguish filenames when reading/writing unicode filenames?


I am writing go code but I don't believe its unique to go so lets generalize it. Imagine a user via go code creates three files with three distinct unicode names. Notice the last letters of the filename are different.

  • καθέδρα.txt
  • καθέδρᾳ.txt
  • καθέδραι.txt

In go, these three strings are three different unique strings. It appears, that if you try to create three files with these three names, you end up with two files saved to disk. The second and third filenames appear to be treated as identical files. So when the script writes three user created files, one goes "missing".

If you write καθέδρᾳ.txt then καθέδραι.txt you end up with only the first filename.

If you write καθέδραι.txt then καθέδρᾳ.txt you end up with only the first filename.

How do you guard in golang against strange OS/X filename behavior in unicode? It appears to think two different strings are one filename.


Solution

  • When you choose a case insensitive file system on OS/X, the case insensitivity process is more complex than our intuition would expect. Depending on the language, the rules are different.

    • Uppercase a is A (in English).
    • Uppercase of I is İ some languages.
    • Apparently ᾳ equates to αι.

    There is no real way to guard against this except to detect the file system type.

    The cross platform way to prevent the problem would be to have your software write a file and read it back using a different "case" to detect if the problem exists.