Search code examples
c#encodingmonogeditutf-32

UTF32 and C# problems


So I've got some troubles with character encoding. When I put the following two characters into a UTF32 encoded text file:

𩸕
鸕

and then run this code on them:

System.IO.StreamReader streamReader = 
    new System.IO.StreamReader("input", System.Text.Encoding.UTF32, false);
System.IO.StreamWriter streamWriter = 
    new System.IO.StreamWriter("output", false, System.Text.Encoding.UTF32);
    
streamWriter.Write(streamReader.ReadToEnd());

streamWriter.Close();
streamReader.Close();

I get:

鸕
鸕

(same character twice, i.e the input file != output)

A few things that might help: Hex for the first character:

15 9E 02 00

And for the second:

15 9E 00 00

I am using gedit for the text file creation, mono for the C# and I'm using Ubuntu.

It also doesn't matter if I specify the encoding for the input or output file, it just doesn't like it if it's in UTF32 encoding. It works if the input file is in UTF-8 encoding.

The input file is as follows:

FF FE 00 00 15 9E 02 00 0A 00 00 00 15 9E 00 00 0A 00 00 00

Is it a bug, or is it just me?

Thanks!


Solution

  • K, so I figured it out I think, it seems to work now. Turns out, since the codes for the characters were 15 9E 02 00 and 15 9E 00 00, then there's no way that they can be held in one, single UTF-16 char. So, instead UTF16 uses these surrogate pairs things where there's two different characters that act as one 'element'. To get elements, we can use:

    StringInfo.GetTextElementEnumerator(string fred);
    

    and this returns a string with the surrogate pairs. Treat it as one character.

    See here:

    http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx

    http://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator.gettextelement.aspx

    Hope it helps someone :D