c#.net character-encoding streamreader ebcdic

Strange behaviour with StreamReader and EBCDIC: Why?

Background: I have to write an application that takes a poorly designed EBCDIC file with binary data in it that uses ASCII line terminators, and sometimes that binary data happens to contain ASCII CRLF which causes the line to split incorrectly. I need to take this old file format and drop the CRLFs at the end of each record.

It seems that using a StreamReader with IBM037 encoding causes the ReadLine() method to only read \r as an end of line instead of \r\n as I'm expecting, so every string (after the first) I get back from ReadLine starts with a LF (0A in ASCII).

Sample program that reproduces the problem:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

class Program
{
  static void Main(string[] args)
  {
    //generate example EBCDIC data
    List<byte> bytes = new List<byte>();
    Encoding EBCDIC = Encoding.GetEncoding("IBM037");
    bytes.AddRange(Encoding.Convert(Encoding.ASCII, EBCDIC, Encoding.ASCII.GetBytes("Some nice ascii text")));
    bytes.AddRange(new byte[] { (byte)'\r', (byte)'\n' });
    bytes.AddRange(Encoding.Convert(Encoding.ASCII, EBCDIC, Encoding.ASCII.GetBytes("Some more nice ascii text")));

    //read it using StreamReader
    using(MemoryStream ms = new MemoryStream(bytes.ToArray()))
    using (StreamReader reader = new StreamReader(ms, EBCDIC))
    {
      string line = string.Empty;
      while ((line = reader.ReadLine()) != null)
      {
        EBCDIC.GetBytes(line).ToList().ForEach(c => Console.Write(c));
        Console.WriteLine();
      }
    }
    Console.ReadLine();
  }
}

The output should be as follows:

226150148133641491371311336412916213113713764163133167163
1022615014813364148150153133641491371311336412916213113713764163133167163

That 10 at the beginning of the second line should not be there, since that is the LF from the CRLF sequence.

My understanding of the ReadLine method was that:

A line is defined as a sequence of characters followed by a line feed ("\n"), a carriage return ("\r"), or a carriage return immediately followed by a line feed ("\r\n"). The string that is returned does not contain the terminating carriage return or line feed. Source

It doesn't say anything about encodings changing that, so according to that it should read the full CRLF in my data and not just the CR.

Update: I have already worked around this problem and implemented my own method of reading the data, but my question is still as follows: Why did ReadLine not do what it says on the tin?

Solution

You stuff a (byte)'\r' and (byte)'\n' into a stream that you tell the StreamReader is encoded in EBCDIC.

The value for (byte) '\r' is 0x0d, which happens to be a carriage return in both ASCII and in EBCDIC.

The value for (byte) '\n' is 0x0a, which is a line feed in ASCII, but is not a line feed in EBCDIC.

If you look at how the EBCDIC Encoder class decodes the value 0x0a into a .NET Unicode char type, you will find that the numeric value of the Unicode char is 142 (or 0x8e). And that character is not a line feed. (I don't know why it's decoded into 142).

You see "10" printed out at the start of the second line not because there's a line feed there, but because the char with value 142 is being re-encoded back to an EBCDIC byte with the value 10 (in the sub-expression EBCDIC.GetBytes(line)).

So to answer your question quite simply, ReadLine() only sees a carriage return, not a carriage return followed by a line feed.

Change your while loop to look like the following:

while ((line = reader.ReadLine()) != null)
{
    line.ToList().ForEach(c => { Console.Write(c); Console.Write(" "); });
    Console.WriteLine();
    line.ToList().ForEach(c => { Console.Write(Convert.ToInt32(c)); Console.Write(" "); });
    Console.WriteLine();
    EBCDIC.GetBytes(line).ToList().ForEach(c => { Console.Write(c); Console.Write(" "); });
    Console.WriteLine();
    Console.WriteLine();
    Console.WriteLine();
}

and you'll get the following output for your second line, which displays the line (converted from EBCDIC) as characters, the Unicode values for those characters, and finally the values of those characters converted back to EBCDIC:

? S o m e   m o r e   n i c e   a s c i i   t e x t
142 83 111 109 101 32 109 111 114 101 32 110 105 99 101 32 97 115 99 105 105 32 116 101 120 116
10 226 150 148 133 64 148 150 153 133 64 149 137 131 133 64 129 162 131 137 137 64 163 133 167 163