Search code examples
c#unicodecharacter-encodinggb2312

Chinese Simplified to Hex GB2312 encoding in C#


I am having issue trying to convert a string containing Simplified Chinese to double byte encoding (GB2312). This is for printing Chinese characters to a zebra printer.

The specs I am looking at show an example with the text of "冈区色呆" which they show as converting to a hex value of 38_54_47_78_49_2b_34_74.

In my C# code I am trying to convert this using the below code as a test. My result seems to be off by 7 in the leading hex value. What am I missing here?

       private const string SimplifiedChineseChars = "冈区色呆";

        [TestMethod]
        public void GetBackCorrectHexValues()
        {
            byte[] bytes = Encoding.GetEncoding(20936).GetBytes(SimplifiedChineseChars);
            string hex = BitConverter.ToString(bytes).Replace("-", "_");                
            //I get the following: B8_D4_C7_F8_C9_AB_B4_F4
            //I am expecting:      38_54_47_78_49_2b_34_74
        }

Solution

  • The only thing that makes sense to me is that 38_54_47_78_49_2b_34_74 is some form of 7-bit encoding.

    Interestingly, a 7-bit version of the GB2312 encoding does exist, and is called the HZ character encoding.

    Here is the wikipedia entry on HZ. Interesting parts:

    The HZ ... encoding was invented to facilitate the use of Chinese characters through e-mail, which at that time only allowed 7-bit characters.

    the HZ code uses only printable, 7-bit characters to represent Chinese characters.

    And, according to this Microsoft reference page on EncodingInfo.GetEncoding, this character encoding is supported in .NET:

    52936 hz-gb-2312 Chinese Simplified (HZ)

    If I try your code, and replace the character encoding to use HZ, I get:

    static void Main(string[] args)
    {
        const string SimplifiedChineseChars = "冈区色呆";
        byte[] bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(SimplifiedChineseChars);
        string hex = BitConverter.ToString(bytes).Replace("-", "_");
        Console.WriteLine(hex);
    }
    

    Output:

    7E_7B_38_54_47_78_49_2B_34_74_7E_7D

    So, you basically get exactly what you are looking for, except that it adds the escape sequences ~{ and ~} before and after the chinese character bytes. Those escape sequences are necessary because this encoding supports mixing ASCII character bytes (single byte encoding) with GB chinese character bytes (double byte encoding). The escape sequences mark the areas that should not be interpreted as ASCII.

    If you choose to use the hz-gb-2312 encoding, you would have to strip any unwanted escape sequences yourself, if you think you don't need them. But, perhaps you do need them. You'll have to figure out exactly what your printer is expecting.

    Alternatively, if you really don't want to have those escape sequences and if you are not worried about having to handle ASCII characters, and are confident that you only have to deal with chinese double byte characters, then you could choose to stick with using the vanilla GB2312 encoding, and then drop the most significant bit of every byte yourself to essentially convert the results to 7-bit encoding.

    Here is what the code could look like. Notice that I mask each byte value with 0x7F to drop the 8th bit.

    static void Main(string[] args)
    {
        const string SimplifiedChineseChars = "冈区色呆";
        byte[] bytes = Encoding.GetEncoding("gb2312") // vanilla gb2312 encoding
                .GetBytes(SimplifiedChineseChars)
                .Select(b => (byte)(b & 0x7F)) // retain 7 bits only
                .ToArray();
        string hex = BitConverter.ToString(bytes).Replace("-", "_");
        Console.WriteLine(hex);
    }
    

    Output:

    38_54_47_78_49_2B_34_74