c++unicode codeblocks chinese-locale extended-ascii

Strange ASCII response(chinese) when trying to replicate strlwr in codeblocks 13.12

The following code gives a very strange result:

#include <iostream>
#include <fstream>

using namespace std;

ifstream f("f1.in");
ofstream g("f1.out");
char sir[255];
int i;

char strlwr(char sir[]) //if void nothing changes
{
    int i = 0;

    for (i = 0; sir[i] != NULL; i++) {
        sir[i] = tolower(sir[i]);
    }

    return 0;  //if instead of 0 is 1 it will kinda work , but strlwr(sir) still needs to   be displayed
}

int main()
{
    f.get(sir, 255);
    g << sir << '\n'; // without '\n' strlwr will no more maters
    g << strlwr(sir);
    g << sir;
    return 0;
}

f1.in:

JHON HAS A COW

f1.out:

䡊乏䠠十䄠䌠坏 
桪湯栠獡愠挠睯

It shows this only when I am using just CAPS.
I am using Code::Blocks 13.12 on Ubuntu 14, European version.
I would be very interested in knowing why it shows this.
I am interested in knowing if it gives you the same thing.

Solution

Congratulations! You've discovered mojibake! Your output text is 100% correct, but whatever your viewing it with is interpreting it as unicode.

If you convert the unicode output into their hex numerical values, the issue will become clear. (Code borrowed from this StackOverflow answer.)

$ cat unicode.txt
䡊乏䠠十䄠䌠坏
桪湯栠獡愠挠睯

$ cat unicode.txt | while IFS= read -r -d '' -n1 c; do printf "%02X\n" "'$c"; done
484A
4E4F
4820
5341
4120
4320
574F
0A
686A
6E6F
6820
7361
6120
6320
776F
0A

The second command reads the file character by character and prints the little endian form in hex. The reason each character is two bytes of data is because the input is understood to be UTF-16, a 2-byte encoding.

If you reinterpret the hex output as single byte ASCII instead (and correct for endianness) you can see that your program did work:

$ cat unicode.txt | while IFS= read -r -d '' -n1 c; do printf "%02X\n" "'$c"; done
484A ; JH
4E4F ; ON
4820 ;  H
5341 ; AS
4120 ;  A
4320 ;  C
574F ; OW
0A   ; \n
686A ; jh
6E6F ; on
6820 ;  h
7361 ; as
6120 ;  a
6320 ;  c
776F ; ow
0A   ; \n

To determine if the issue is your C++ program or your viewing program, try running the following command xxd f1.out. If it looks like ASCII, then it's your viewing programs fault. Otherwise, it's your program's fault and you should look into setlocale and/or opening your output file in binary mode.

Either way, you should probably change g<<strlwr(sir); to just strlwr(sir);. Currently it's adding a NULL byte to your output which is probably unintended.