Search code examples
c#encodingutf-8streamreaderdecoding

Convert a String, which is already malformed


I have a class, which uses another class which reads a Textfile. The Textfile is written in Ascii or to be clear CP1525.

Background info: The Textfile is generated in Axapta and uses the ASCIIio class which writes the text by using the writeRaw method

The class which I am using is by a collegue and he is using a C# StreamReader to read files. Normally this works okay because the files are written in UTF8, but in this particular case it isn't.

So the Streamreader reads the file as UTF8 and passes the read string to me. I now have some letters, like for example the Lating small letter o with Diaeresis (ö) which aren't formated as I would need them to be.

A simple convert of the String doesn't help in this case and I can't figure out how I can get the right letters.

So this is basically how he reads it:

char quotationChar = '"';
String line = "";
using (StreamReader reader = new StreamReader(fileName))
{
    if((line = reader.ReadLine()) != null)
    {
        line = line.Replace(quotationChar.ToString(), "");
    }
}
return line;

What now happens is, in the Textfile I have the german word "Röhre" which, after reading it with the streamreader, transforms to R�hre (which looks stupid in a database).

I could try to convert every letter

Encoding enc = Encoding.GetEncoding(1252); 
byte[] utf8_Bytes = new byte[line.Length];
for (int i = 0; i < line.Length; ++i)
{
    utf8_Bytes[i] = (byte)line[i];
}
String propEncodeString = enc.GetString(utf8_Bytes, 0, utf8_Bytes.Length);

That doesn't give me the right character !

byte[] myarr = Encoding.UTF8.GetBytes(line);
String propEncodeString = enc.GetString(myarr);

That also returns the wrong character.

I am aware that I could just solve the problem by using this:

using (StreamReader reader = new StreamReader(fileName, Encoding.Default, true))

But just for fun: How can I get the right string from an already wrongly decoded string ?


Solution

  • Once the UTF8 to ASCII conversion is first made, all characters that don't map to valid ASCII entries are replaced with the same bad data character which means that data is just lost and you can't simply 'convert' back to a good character downstream. See this example: https://dotnetfiddle.net/XWysml