Search code examples
c++unicodecodeblockschinese-localeextended-ascii

Strange ASCII response(chinese) when trying to replicate strlwr in codeblocks 13.12


The following code gives a very strange result:

#include <iostream>
#include <fstream>

using namespace std;

ifstream f("f1.in");
ofstream g("f1.out");
char sir[255];
int i;

char strlwr(char sir[]) //if void nothing changes
{
    int i = 0;

    for (i = 0; sir[i] != NULL; i++) {
        sir[i] = tolower(sir[i]);
    }

    return 0;  //if instead of 0 is 1 it will kinda work , but strlwr(sir) still needs to   be displayed
}

int main()
{
    f.get(sir, 255);
    g << sir << '\n'; // without '\n' strlwr will no more maters
    g << strlwr(sir);
    g << sir;
    return 0;
}

f1.in:

JHON HAS A COW

f1.out:

䡊乏䠠十䄠䌠坏 
桪湯栠獡愠挠睯

It shows this only when I am using just CAPS.
I am using Code::Blocks 13.12 on Ubuntu 14, European version.
I would be very interested in knowing why it shows this.
I am interested in knowing if it gives you the same thing.


Solution

  • Congratulations! You've discovered mojibake! Your output text is 100% correct, but whatever your viewing it with is interpreting it as unicode.

    If you convert the unicode output into their hex numerical values, the issue will become clear. (Code borrowed from this StackOverflow answer.)

    $ cat unicode.txt
    䡊乏䠠十䄠䌠坏
    桪湯栠獡愠挠睯
    
    $ cat unicode.txt | while IFS= read -r -d '' -n1 c; do printf "%02X\n" "'$c"; done
    484A
    4E4F
    4820
    5341
    4120
    4320
    574F
    0A
    686A
    6E6F
    6820
    7361
    6120
    6320
    776F
    0A
    

    The second command reads the file character by character and prints the little endian form in hex. The reason each character is two bytes of data is because the input is understood to be UTF-16, a 2-byte encoding.

    If you reinterpret the hex output as single byte ASCII instead (and correct for endianness) you can see that your program did work:

    $ cat unicode.txt | while IFS= read -r -d '' -n1 c; do printf "%02X\n" "'$c"; done
    484A ; JH
    4E4F ; ON
    4820 ;  H
    5341 ; AS
    4120 ;  A
    4320 ;  C
    574F ; OW
    0A   ; \n
    686A ; jh
    6E6F ; on
    6820 ;  h
    7361 ; as
    6120 ;  a
    6320 ;  c
    776F ; ow
    0A   ; \n
    

    To determine if the issue is your C++ program or your viewing program, try running the following command xxd f1.out. If it looks like ASCII, then it's your viewing programs fault. Otherwise, it's your program's fault and you should look into setlocale and/or opening your output file in binary mode.

    Either way, you should probably change g<<strlwr(sir); to just strlwr(sir);. Currently it's adding a NULL byte to your output which is probably unintended.