Search code examples
c#encoding

Decoding gb18030 to UTF8 in C#


I have a text file, the contents if open in notepad shows:

ʸ³ßÓÀ¼ª

If I drag it to chrome browser, it automatically decode and display correctly as

矢尺永吉

After a bit of research, the code in the file is encoded with gb18030. I am attempting to do the conversion in C#. Below is my code:

public static string codeCovert(string s)
    {
        Encoding gb18 = Encoding.GetEncoding("gb18030");
        Encoding Utf8 = Encoding.UTF8;

        byte[] gbcode = gb18.GetBytes(s);

        return Utf8.GetString(gbcode);      
    }

And this still gives a whole bunch of wrong characters. Can anyone help please? Thanks.


Solution

  • Your method takes in a string and returns another string which does not make sense. System.String is a "vector" of UTF-16 code units.

    You should do:

    using System.Text;
    using System.IO;
    
    // ...
    
      var str = File.ReadAllText(@"path\file.txt", Encoding.GetEncoding("GB18030"));
    

    While str is in memory, it has the value "矢尺永吉". It cannot be "UTF-8" when it is a .NET string in memory. You can save it to another file, of course:

      File.WriteAllText(@"path\otherfile.txt", str, Encoding.UTF8);
    

    Edit: In newer versions of .NET, you need to do:

    Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
    

    before you can use Encoding.GetEncoding("GB18030").