Search code examples
c++character-encodingmfcwindows-1252

Why does my unicode enabled software not recognise 'Š' and other characters in ANSI files? How to fix it?


I have a MFC project which reads and writes from and to ANSI files. The Character Set of the application is set to Unicode.

Addendum
I do not have the possibility to change/influence the encoding of the input and neither the output file because in my context we are talking about a converter between legacy software. The character encoding expected is actually windows-1252.

When reading and writing some files, I noticed that some rarely used characters like Š (0x8A) get replaced by ? (0x3F) when reading and writing them with CStdioFile. I've created a testfile to see what characters are affected in the range between 0x30 and 0xFF.

I copied those chars to a Testfile (ANSI coded) (Characters from 0x30 to 0xFF)

Input file structure interpreted by Beyond Compare

And the resultant file looked like this:

Output file structure interpreted by Beyond Compare

The changed characters are all around the same region and are all changed to 0x3F '?'- starting from 0x80 up to 0x9F. Strangely enough there are some exceptions like 0x81, 0x8D, 0x90 and 0x9D which were not affected.

Example Code to test the behaviour:

//prepare vars
CFileException fileException;
CStdioFile filei;
CStdioFile fileo;
CString strText;


//open input file
filei.Open(TEXT("test.txt"), CFile::modeRead | CFile::shareExclusive | CFile::typeText, &fileException);

//open output file 
fileo.Open(TEXT("testout.txt"), CFile::modeCreate | CFile::modeWrite | CFile::shareExclusive | CFile::typeText, &fileException);

//read and write 
BOOL eof = filei.ReadString(strText) <= 0;
fileo.Write(CStringA(strText), CStringA(strText).GetLength());

//clean up
filei.Close();
fileo.Close();

Why does it do that and what would I need to do to preserve all characters?

Disabling the unicode mode would fix the issue but is unfortunately not an option in my case.


Summary
Here's an extract of the things that were useful for me from the accepted answer:

Don't convert from CStringW to CStringA by just calling it's constructor. When converting from Unicode to "ANSI" (Windows1252), use CW2A:

CStringA strTextA(strText, CP_ACP)` //CP_ACP converts to ANSI
fileo.Write(strTextA, strTextA.GetLength());    

Even simpler: use the CStdioFile::WriteString method instead of CStdioFile::WriteS:

fileo.Open(TEXT("testout.txt"), CFile::modeCreate | CFile::modeWrite | CFile::shareExclusive | CFile::typeText, &fileException);
fileo.WriteString(strText);

Solution

  • The problem is that by default if you use the CStdioFile::Open method the CStdioFile is only capable of reading/writing ANSI files but you can open the file-stream yourself and then you will be able to specify the correct encoding:

    FILE* fStream = NULL;
    errno_t e = _tfopen_s(&fStream, _T("C:\\Files\\test.txt"), _T("rt,ccs=UNICODE"));
    if (e != 0) 
        return; // failed to open file 
    CStdioFile f(fStream);  
    CString sRead;
    f.ReadString(sRead);
    f.Close();
    

    If you'd like to write file you need to use _T("wt,ccs=UNICODE") set of options.

    The other obvious problem is that you call Write instead of WriteString. There is no need to convert CStringW to CStringA in case of WriteString. If it is required to use Write for some reason you'll have to properly convert CStringW to CStringA by calling to CW2A() with CP_UTF8.

    Here is the sample code that uses general purpose CFile class and Write instead of CStdioFile and WriteString:

    CStringW sText = L"Привет мир";
    
    CFile file(_T("C:\\Files\\test.txt"), CFile::modeWrite | CFile::modeCreate);
    
    CStringA sUTF8 = CW2A(sText, CP_UTF8);
    file.Write(sUTF8 , sUTF8.GetLength());
    

    Please keep in mind that CFile constructor that opens file and Write method throw CFileException type of exceptions. So you should handle them.

    Use the following options when opening text file streams to specify the type of encoding:

    • "ccs=UNICODE" corresponds to UTF-16 (Big endian)
    • "ccs=UTF-8" corresponds to UTF-8
    • "ccs=UTF-16LE" corresponds to UTF-16LE (Little endian)