Search code examples
c++filewchar-twstringkernel32

GetFileAttributeW fails for non-ASCII characters


So I am trying to check if a given file exists or not. Following this answer I tried GetFileAttributesW. It works just fine for any ascii input, but it fails for ß, ü and á (and any other non-ascii character I suspect). I get ERROR_FILE_NOT_FOUND for filenames with them and ERROR_PATH_NOT_FOUND for pathnames with them, as one would expect if they didn't exists.

I made 100% sure that they did. I spend 15 minutes on copying filenames to not make typos and using literals to avoid any bad input. I couldn't find any mistake.

Since all of these characters are non-ascii characters I stopped trying, because I suspected I might have screwed up with encodings. I just can't spot it. Is there something I am missing? I link against Kernel32.lib

Thanks!

#include <stdio.h>
#include <iostream>
#include <string>
#include "Windows.h"


void main(){
    while(true){
        std::wstring file_path;
        std::getline(std::wcin, file_path);

        DWORD dwAttrib = GetFileAttributesW(file_path.data());
        if(dwAttrib == INVALID_FILE_ATTRIBUTES){
            printf("error: %d\n", GetLastError());
            continue;
        }

        if(!(dwAttrib & FILE_ATTRIBUTE_DIRECTORY))
            printf("valid!\n");
        else
            printf("invalid!\n");
    }
}

Solution

  • It's extremely hard to make Unicode work well in a console program on Windows, so let's start by removing that aspect of it (for now).

    Modify your program so that it looks like this:

    #include <cstdio>
    #include <iostream>
    #include <string>
    #include "Windows.h"
    
    int main() {
        std::wstring file_path = L"fooß.txt";
    
        DWORD dwAttrib = GetFileAttributesW(file_path.data());
        if (dwAttrib == INVALID_FILE_ATTRIBUTES)
            printf("error: %d\n", GetLastError());
    
        if (!(dwAttrib & FILE_ATTRIBUTE_DIRECTORY))
            printf("valid!\n");
        else
            printf("invalid!\n");
    
        return 0;
    }
    

    Make sure this file is saved with a byte-order mark (BOM), even if you're using UTF-8. Windows applications, including Visual Studio and the compilers, can be very picky about that. If your editor won't do that, use Visual Studio to edit the file and then use Save As, click the down arrow next to the Save button, choose With Encoding. In the Advanced Save Options dialog, choose "Unicode (UTF-8 with signature) - Codepage 65001".

    Make sure you have a file named fooß.txt in the current folder. I strongly recommend using a GUI program to create this file, like Notepad or Explorer.

    This program works. If you still get a file-not-found message, check to make sure the temporary file is in the working directory or change the program to use an absolute path. If you use an absolute path, use backslashes and make sure they are all properly escaped. Check for typos, the extension, etc. This code does work.

    Now, if you take the file name from standard input:

        std::wstring file_path;
        std::getline(std::wcin, file_path);
    

    And you enter fooß.txt in the console window, you'll probably find that it doesn't work. And if you look in the debugger, you'll see that the character that should be ß is something else. For me, it's á, but it might be different for you if your console codepage is something else.

    ß is U+00DF in Unicode. In Windows 1252 (the most common codepage for Windows users in the U.S.), it's 0xDF, so it might seem like there's no chance of a conversion problem. But the console windows (by default) use OEM code pages. In the U.S., the common OEM codepage is 437. So when I try to type ß in the console, that's actually encoded as 0xE1. Surprise! That's the same as the Unicode value for á. And if you manage to enter a character with the value 0xDF, you'll see that corresponds to the block character you reported in the original question.

    You would think (well, I would think) that asking for the input from std::wcin would do whatever conversion is necessary. But it doesn't, and there's probably some legacy backward compatibility reason for that. You could try to imbue the stream with the "proper" codepage, but that gets complicated, and I've never bothered trying to make it work. I've simply stopped trying to use anything other than ASCII on the console.