I have an Arabic file (ASCII), which contains: 121101 الزبون كمال 121102 الزبون سعيد 121103 الزبون عمار
I want to read this file using an std::ifstream in C++ as:
std::ifstream ifs(file.GetFileName());
std::string content((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
When I had watched a content variable using VS IDE, I got a character encoding error: 121101 ÇáÒÈæä ßãÇá 121102 ÇáÒÈæä ÓÚíÏ 121103 ÇáÒÈæä ÚãÇÑ
Also I have tray an std::wifstream:
std::wifstream ifs2(file.GetFileName());
std::string content2((std::istreambuf_iterator<wchar_t>(ifs2)), std::istreambuf_iterator<wchar_t>());
I have got the same error. Could anybody help me? Thank.
I have an Arabic file (ASCII), which contains: 121101 الزبون كمال 121102 الزبون سعيد 121103 الزبون عمار
After some clarification, OP wants:
to write general function which read uft8 and ANSI files
To be able to treat the content in the same way, I suggest converting to an UTF-16 encoded std::wstring
. OP seems to develop for the Windows platform, where UTF-16 is the encoding expected by most APIs. On other platforms (Linux) it might be more suitable to convert everything to UTF-8 instead.
To be able to decode ANSI (aka extended ASCII), we have to know the codepage of the file.
The codepage (or more precisely the locale) can be defined via the stream's imbue()
method. In your case the code page is 1256.
The following example reads the content of a text file that is encoded with ANSI codepage 1256 and displays the text using MessageBoxW()
which expects an UTF-16 encoded string:
#include <fstream>
#include <string>
#include <codecvt>
#include <Windows.h>
int main()
{
// Use wifstream because we want to read content into a wstring.
std::wifstream f{"test.txt"};
// Define the code page of the text file (1256 = Arabic)
f.imbue( std::locale( ".1256" ) );
// Read the whole file into a wstring.
// The stream converts from ANSI to UTF-16 encoding.
std::wstring s{ std::istreambuf_iterator<wchar_t>( f ), std::istreambuf_iterator<wchar_t>() };
// Display the string which is now UTF-16 encoded.
::MessageBoxW( NULL, s.c_str(), L"test", 0 );
return 0;
}
NOTE: The std::locale
parameter is platform specific. ".1256" works on the Windows platform, but this would propably not work on Linux for instance.
For this we can employ the std::codecvt_utf8_utf16
facet.
Replace the imbue()
call of the previous example with the following code:
f.imbue( std::locale( f.getloc(),
new std::codecvt_utf8_utf16< wchar_t, 1114111UL, std::consume_header> ) );
The flag std::consume_header
skips the byte order mark if it exists.
Notes:
The code samples presented above require you to know the encoding of the text files beforehand. Detecting the encoding of a text file in a truly generic way is a hard task because there is no standard way of doing that. It cannot be done reliably, you'd have to use some heuristics.
If you can make some assumptions about the files you have to handle, you can write a simple detection function though. Say the files fall only into the following categories:
Then you could read the first 3 bytes of the file using std::ifstream
and compare them with {0xEF, 0xBB, 0xBF}
. If equal, you could be relatively sure that the file is UTF-8 encoded because it will be unlikely that a non-UTF-8 encoded file begins with these bytes. If not equal, you would assume code page 1256.