Search code examples
c#stringutf-8utf-7

Utf7Encoding Text truncation


I was having an issue with the Utf7Encoding class truncating the '+4' sequence. I would be very interested to know why this was happening. I tried Utf8Encoding for getting string from the byte[] array and it seem to work honky dory. Are there any known issues like that with Utf8? Essentially I use the output produced by this conversion to construct html out of rtf string.

Here is the snippet:

    UTF7Encoding utf = new UTF7Encoding(); 
    UTF8Encoding utf8 = new UTF8Encoding(); 

    string test = "blah blah 9+4"; 

    char[] chars = test.ToCharArray(); 
    byte[] charBytes = new byte[chars.Length]; 

    for (int i = 0; i < chars.Length; i++) 
    { 

        charBytes[i] = (byte)chars[i]; 

     }


    string resultString = utf8.GetString(charBytes); 
    string resultStringWrong = utf.GetString(charBytes); 

    Console.WriteLine(resultString);  //blah blah 9+4  
    Console.WriteLine(resultStringWrong);  //blah 9  

Solution

  • Converting to byte array through char array like that does not work. If you want the strings as charset-specific byte[] do this:

    UTF7Encoding utf = new UTF7Encoding();
    UTF8Encoding utf8 = new UTF8Encoding();
    
    string test = "blah blah 9+4";
    
    byte[] utfBytes = utf.GetBytes(test);
    byte[] utf8Bytes = utf8.GetBytes(test);
    
    string utfString = utf.GetString(utfBytes);
    string utf8String = utf8.GetString(utf8Bytes);
    
    Console.WriteLine(utfString);  
    Console.WriteLine(utf8String);
    

    Output:

    blah blah 9+4

    blah blah 9+4