stat function : no such file or directory error

When my program tries to stat() a file containing specific UTF-8 characters, the stat() function returns an error. For example, I can open the file /tmp/surgateDlpMgQure/Özkul Gazete with vi, but passing this same file to stat() generates an error. System locale settings are:

LANG=en_US.UTF-8

LC_CTYPE="en_US.UTF-8"

LC_COLLATE=C

LC_TIME="en_US.UTF-8"

LC_NUMERIC="en_US.UTF-8"

LC_MONETARY="en_US.UTF-8"

LC_MESSAGES="en_US.UTF-8"

LC_ALL=

Should I do something in order for stat() to understand the UTF-8 characters?

Here is the code:

int main ()
{
    struct stat s;
    if (stat("/tmp/surgateDlpMgQure/Özkul Gazete", &s) == -1)
            perror("stat");


    switch (s.st_mode & S_IFMT) {
            case S_IFBLK:  printf("block device\n");            break;
            case S_IFCHR:  printf("character device\n");        break;
            case S_IFDIR:  printf("directory\n");               break;
            case S_IFIFO:  printf("FIFO/pipe\n");               break;
            case S_IFLNK:  printf("symlink\n");                 break;
            case S_IFREG:  printf("regular file\n");            break;
            case S_IFSOCK: printf("socket\n");                  break;
            default:       printf("unknown?\n");                break;
    }

 return 0;
}

Solution

The problem is probably that the encoding of the file name isn't the same as the encoding you're using internally in your program. The key questions here are who created the file (and gave it this name), and where the string in your code comes from. Most of Unix is agnostic with regards to the encoding, as long as a few special characters, like '/', have the expected encoding. Thus, independently of your current locale, a file name can be in Latin-1, Latin-5 (just guessing, but the name looks Turkish) or UTF-8. Practically nothing in Unix cares, but you have to ensure that in your program, you use the same encoding as was used to create the file, or the names won't match. (In practice, I've found the simplest policy to be to limit the characters in a filename to a very small set: the ASCII alphanumeric characters, digits, '_' and possibly '-'.)

If you're not sure about the actual encoding of the filename on disk, you can use ls | od -t x1 -tc to find out the actual value of the bytes in it. If your Ö is 0xD6, then the encoding is either Latin-1 or Latin-5 (and it probably won't make much difference which), and you'll have to ensure that the filename you pass stat (or open, or any other functions which take a filename) is encoded in one of these encodings. If instead you have the two byte sequence 0xC3, 0x96, then the filename is UTF-8.

If you do want to support characters outside of the ASCII subset, then I'd strongly recommend that you ensure that all filenames are encoded in UTF-8. Supposing that you can—the encoding will be decided by the program creating the file, and if it's not your program (or if you're receiving the file from another system), you may not be able to do anything about it. In the worst case scenario, you may even have to use opendir and readdir with some sort of matching algorithm to find the actual filename (in whatever the encoding), and use it.