When my program tries to stat()
a file containing specific UTF-8 characters, the stat()
function returns an error. For example, I can open the file /tmp/surgateDlpMgQure/Özkul Gazete
with vi, but passing this same file to stat()
generates an error. System locale settings are:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE=C
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
Should I do something in order for stat()
to understand the UTF-8 characters?
Here is the code:
int main ()
{
struct stat s;
if (stat("/tmp/surgateDlpMgQure/Özkul Gazete", &s) == -1)
perror("stat");
switch (s.st_mode & S_IFMT) {
case S_IFBLK: printf("block device\n"); break;
case S_IFCHR: printf("character device\n"); break;
case S_IFDIR: printf("directory\n"); break;
case S_IFIFO: printf("FIFO/pipe\n"); break;
case S_IFLNK: printf("symlink\n"); break;
case S_IFREG: printf("regular file\n"); break;
case S_IFSOCK: printf("socket\n"); break;
default: printf("unknown?\n"); break;
}
return 0;
}
The problem is probably that the encoding of the file name isn't the
same as the encoding you're using internally in your program. The key
questions here are who created the file (and gave it this name), and
where the string in your code comes from. Most of Unix is agnostic with
regards to the encoding, as long as a few special characters, like
'/'
, have the expected encoding. Thus, independently of your
current locale, a file name can be in Latin-1, Latin-5 (just guessing,
but the name looks Turkish) or UTF-8. Practically nothing in Unix cares,
but you have to ensure that in your program, you use the same encoding
as was used to create the file, or the names won't match. (In practice,
I've found the simplest policy to be to limit the characters in a
filename to a very small set: the ASCII alphanumeric characters, digits,
'_'
and possibly '-'
.)
If you're not sure about the actual encoding of the filename on disk,
you can use ls | od -t x1 -tc
to find out the actual value of the
bytes in it. If your Ö
is 0xD6, then the encoding is either Latin-1
or Latin-5 (and it probably won't make much difference which), and
you'll have to ensure that the filename you pass stat
(or open
, or
any other functions which take a filename) is encoded in one of these
encodings. If instead you have the two byte sequence 0xC3, 0x96, then
the filename is UTF-8.
If you do want to support characters outside of the ASCII subset, then
I'd strongly recommend that you ensure that all filenames are encoded
in UTF-8. Supposing that you can—the encoding will be decided by
the program creating the file, and if it's not your program (or if
you're receiving the file from another system), you may not be able to
do anything about it. In the worst case scenario, you may even have to
use opendir
and readdir
with some sort of matching algorithm to find
the actual filename (in whatever the encoding), and use it.