Reading ASCII text file using an std::ifstream in C++

I have an Arabic file (ASCII), which contains: 121101 الزبون كمال 121102 الزبون سعيد 121103 الزبون عمار

I want to read this file using an std::ifstream in C++ as:

std::ifstream ifs(file.GetFileName());
std::string content((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());

When I had watched a content variable using VS IDE, I got a character encoding error: 121101 ÇáÒÈæä ßãÇá 121102 ÇáÒÈæä ÓÚíÏ 121103 ÇáÒÈæä ÚãÇÑ

Also I have tray an std::wifstream:

std::wifstream ifs2(file.GetFileName());
std::string content2((std::istreambuf_iterator<wchar_t>(ifs2)), std::istreambuf_iterator<wchar_t>());

I have got the same error. Could anybody help me? Thank.

Solution

I have an Arabic file (ASCII), which contains: 121101 الزبون كمال 121102 الزبون سعيد 121103 الزبون عمار

After some clarification, OP wants:

to write general function which read uft8 and ANSI files

To be able to treat the content in the same way, I suggest converting to an UTF-16 encoded std::wstring. OP seems to develop for the Windows platform, where UTF-16 is the encoding expected by most APIs. On other platforms (Linux) it might be more suitable to convert everything to UTF-8 instead.

Reading ANSI text file into UTF-16 encoded wstring

To be able to decode ANSI (aka extended ASCII), we have to know the codepage of the file.

The codepage (or more precisely the locale) can be defined via the stream's imbue() method. In your case the code page is 1256.

The following example reads the content of a text file that is encoded with ANSI codepage 1256 and displays the text using MessageBoxW() which expects an UTF-16 encoded string:

#include <fstream>
#include <string>
#include <codecvt>
#include <Windows.h>

int main()
{
    // Use wifstream because we want to read content into a wstring.
    std::wifstream f{"test.txt"};

    // Define the code page of the text file (1256 = Arabic)
    f.imbue( std::locale( ".1256" ) );

    // Read the whole file into a wstring.
    // The stream converts from ANSI to UTF-16 encoding.
    std::wstring s{ std::istreambuf_iterator<wchar_t>( f ), std::istreambuf_iterator<wchar_t>() };

    // Display the string which is now UTF-16 encoded.    
    ::MessageBoxW( NULL, s.c_str(), L"test", 0 );

    return 0;
}

NOTE: The std::locale parameter is platform specific. ".1256" works on the Windows platform, but this would propably not work on Linux for instance.

Reading UTF-8 encoded text file into UTF-16 encoded wstring

For this we can employ the std::codecvt_utf8_utf16 facet. Replace the imbue() call of the previous example with the following code:

    f.imbue( std::locale( f.getloc(), 
        new std::codecvt_utf8_utf16< wchar_t, 1114111UL, std::consume_header> ) );

The flag std::consume_header skips the byte order mark if it exists.

Notes:

The code samples have been tested with VS2017 under Windows 10 with german localization.
For brevity I omitted error handling. Stream state should be checked after opening and after reading from the stream.

Creating a generic solution

The code samples presented above require you to know the encoding of the text files beforehand. Detecting the encoding of a text file in a truly generic way is a hard task because there is no standard way of doing that. It cannot be done reliably, you'd have to use some heuristics.

If you can make some assumptions about the files you have to handle, you can write a simple detection function though. Say the files fall only into the following categories:

ANSI encoded file with code page 1256
UTF-8 encoded file with BOM (byte order mark)

Then you could read the first 3 bytes of the file using std::ifstream and compare them with {0xEF, 0xBB, 0xBF}. If equal, you could be relatively sure that the file is UTF-8 encoded because it will be unlikely that a non-UTF-8 encoded file begins with these bytes. If not equal, you would assume code page 1256.