Search code examples
c++filesystemsc++17sfml

Passing unicode sf::String into std::filesystem::u8path


I was trying to get sf::String into std::filesystem::u8path. My first method is converting it into an std::string, (std::string)sfstringbar but it sees it as a single byte character, i also tried auto x = sfstringbar.toUtf8() std::string(x.begin(), x.end()) but the same. My second method is to pass it as a char array which hopefully could read it with the UTF 8 encoding, but still the same happens.

EDIT:

char* makeutf8str(str string) {
    std::basic_string<sf::Uint8> utf8 = string.toUtf8();
    std::vector<char>* out = new std::vector<char>;
    for (auto x = utf8.begin(); x != utf8.end(); x++) {
        out->push_back(*x);
    }
    return &(out->at(0));
}

bool neaxfile::isfile(str file) {
    std::cout << "\nThis: " << makeutf8str(file) << "\n";
    return std::filesystem::is_regular_file(std::filesystem::u8path(makeutf8str(file)));
}

Here's about the second solution i tried. I have a file called Яyes.txt as an example, but when i pass in to check if it exists, it says it doesn't. Because the makeutf8str() function splits Я into Ð and ¯. I can't seem to get the encoder to work properly.

EDIT 2:

str neaxfile::getcwd() {
    std::error_code ec;
    str path = std::filesystem::current_path(ec).u8string();
    if (ec.value() == 0) {
        return path;
    } else {
        return '\0';
    }
}

std::vector<str> neaxfile::listfiles() {
    std::vector<str> res;
    for (auto entry : std::filesystem::directory_iterator((std::string)neaxfile::getcwd())) {
        if (neaxfile::isfile(entry.path().wstring())) res.push_back(entry.path().wstring());
    }
    return res;
}

I tried the first solution below. It no longer prints Я. But it still doesn't confirm that this is a file. I tried to list the files using that ^


Solution

  • std::filesystem::u8path() "Constructs a path p from a UTF-8 encoded sequence of chars [or char8_ts (since C++20)], supplied either as an std::string, or as std::string_view, or as a null-terminated multibyte string, or as a [first, last) iterator pair."

    A std::string can hold a UTF-8 encoded char sequence (better to use std::u8string in C++20, though). sf::String::ToUtf8() returns a UTF-8 encoded std::basic_string<Uint8>. You can simply cast the UInt8 data to char to construct a std::string, there is no need for your makeutf8str() function to use std::vector<char> or return a raw char* at all (especially since it is leaking the std::vector anyway).

    You can use the std::string constructor that takes a char* and a size_t as input, eg:

    std::string makeutf8str(const str &string) {
        auto utf8 = string.toUtf8();
        return std::string(reinterpret_cast<const char*>(utf8.c_str()), utf8.size());
    }
    

    Or, you can use the std::string constructor that takes a range of iterators as input (despite your claim, this should work just fine), eg:

    std::string makeutf8str(const str &string) {
        auto utf8 = string.toUtf8();
        return std::string(utf8.begin(), utf8.end());
    }
    

    Either way will work fine with std::cout and std::filesystem::u8path(), eg:

    bool neaxfile::isfile(const str &file) {
        auto utf8 = makeutf8str(file);
        std::cout << "\nThis: " << utf8 << "\n";
        return std::filesystem::is_regular_file(std::filesystem::u8path(utf8));
    }
    

    That being said, the Unicode character Я is encoded in UTF-8 as bytes 0xD0 0xAF, which when interpreted as Latin-1 instead of UTF-8 will appear as Я. This means the std::string data is properly UTF-8 encoded, it is just not being processed correctly. For instance, if your console cannot handle UTF-8 output, then you will see Я instead of Я. But, u8path() should process the UTF-8 encoded std::string just fine, and convert it to the filesystem's native encoding as needed. But then, there is no guarantee that the underlying filesystem will actually handle a Unicode filename like Яyes.txt properly, but that would be an OS issue, not a C++ issue.


    UPDATE: your listfiles() function is not making use of UTF-8 at all when using directory_iterator. It is type-casting the sf::String from getcwd() to an ANSI encoded std::string (which is a lossy conversion), not to a UTF-8 encoded std::string. But worse, that sf::String is being constructed by getcwd() from a UTF-8 encoded std::string but the std::string constructor of sf::String requires ANSI by default, not UTF-8 (to fix that, you have to give it a UTF-8 std::locale). So, you are passing through several lossy conversions trying to get a string from the std::filesystem::pathreturned fromstd::filesystem::current_pathtostd::filesystem::directory_iterator`.

    sf::String can convert to/from std::wstring, which std::filesystem::path can also use, so there is no need to go through UTF-8 and std::filesystem::u8path() at all, at least on Windows where std::wstring uses UTF-16 and Windows underlying filesystem APIs also use UTF-16.

    Try this instead:

    bool neaxfile::isfile(const str &file) {
        std::wstring wstr = file;
        std::wcout << L"\nThis: " << wstr << L"\n";
        return std::filesystem::is_regular_file(std::filesystem::path(wstr));
    }
    
    str neaxfile::getcwd() {
        std::error_code ec;
        str path = std::filesystem::current_path(ec).wstring();
        if (ec.value() == 0) {
            return path;
        } else {
            return L"";
        }
    }
    
    std::vector<str> neaxfile::listfiles() {
        std::vector<str> res;
        std::filesystem::path cwdpath(neaxfile::getcwd().wstring());
        for (auto entry : std::filesystem::directory_iterator(cwdpath) {
            str filepath = entry.path().wstring();
            if (neaxfile::isfile(filepath)) res.push_back(filepath);
        }
        return res;
    }
    

    If you really want to use UTF-8 for conversions between C++ strings and SFML strings, then try this instead to avoid any data loss:

    std::string makeutf8str(const str &string) {
        auto utf8 = string.toUtf8();
        return std::string(reinterpret_cast<const char*>(utf8.c_str()), utf8.size());
    }
    
    str fromutf8str(const std::string &string) {
        return str::fromUtf8(utf8.begin(), utf8.end());
    }
    
    bool neaxfile::isfile(const str &file) {
        auto utf8 = makeutf8str(file);
        std::cout << "\nThis: " << utf8 << "\n";
        return std::filesystem::is_regular_file(std::filesystem::u8path(utf8));
    }
    
    str neaxfile::getcwd() {
        std::error_code ec;
        auto path = std::filesystem::current_path(ec).u8string();
        if (ec.value() == 0) {
            return fromutf8str(path);
        } else {
            return "";
        }
    }
    
    std::vector<str> neaxfile::listfiles() {
        std::vector<str> res;
        auto cwdpath = std::filesystem::u8path(makeutf8str(neaxfile::getcwd()));
        for (auto entry : std::filesystem::directory_iterator(cwdpath)) {
            str filepath = fromutf8str(entry.path().u8string());
            if (neaxfile::isfile(filepath)) res.push_back(filepath);
        }
        return res;
    }
    

    That being said, you are doing a lot of unnecessary conversions between C++ strings and SFML strings. You really shouldn't be using SFML strings when you are not directly interacting with SFML's API. You really should be using C++ strings as much as possible, especially with the <filesystem> API, eg:

    bool neaxfile::isfile(const std::string &file) {
        std::cout << L"\nThis: " << file << L"\n";
        return std::filesystem::is_regular_file(std::filesystem::u8path(file));
    }
    
    std::string neaxfile::getcwd() {
        std::error_code ec;
        std::string path = std::filesystem::current_path(ec).u8string();
        if (ec.value() == 0) {
            return path;
        } else {
            return "";
        }
    }
    
    std::vector<std::string> neaxfile::listfiles() {
        std::vector<std::string> res;
        auto cwdpath = std::filesystem::u8path(neaxfile::getcwd());
        for (auto entry : std::filesystem::directory_iterator(cwdpath)) {
            auto filepath = entry.path().u8string();
            if (neaxfile::isfile(filepath)) res.push_back(filepath);
        }
        return res;
    }
    

    Alternatively:

    bool neaxfile::isfile(const std::wstring &file) {
        std::wcout << L"\nThis: " << file << L"\n";
        return std::filesystem::is_regular_file(std::filesystem::path(file));
    }
    
    std::wstring neaxfile::getcwd() {
        std::error_code ec;
        auto path = std::filesystem::current_path(ec).wstring();
        if (ec.value() == 0) {
            return path;
        } else {
            return L"";
        }
    }
    
    std::vector<std::wstring> neaxfile::listfiles() {
        std::vector<std::wstring> res;
        std::filesystem::path cwdpath(neaxfile::getcwd());
        for (auto entry : std::filesystem::directory_iterator(cwdpath)) {
            auto filepath = entry.path().wstring();
            if (neaxfile::isfile(filepath)) res.push_back(filepath);
        }
        return res;
    }
    

    A better option is to simply not pass around strings at all. std::filesystem::path is an abstraction to help shield you from that, eg:

    bool neaxfile::isfile(const std::filesystem::path &file) {
        std::wcout << L"\nThis: " << file.wstring() << L"\n";
        return std::filesystem::is_regular_file(file);
    }
    
    std::filesystem::path neaxfile::getcwd() {
        std::error_code ec;
        auto path = std::filesystem::current_path(ec);
        if (ec.value() == 0) {
            return path;
        } else {
            return {};
        }
    }
    
    std::vector<std::filesystem::path> neaxfile::listfiles() {
        std::vector<std::filesystem::path> res;
        for (auto entry : std::filesystem::directory_iterator(neaxfile::getcwd())) {
            auto filepath = entry.path();
            if (neaxfile::isfile(filepath)) res.push_back(filepath);
        }
        return res;
    }