Search code examples
c#utf-8streamreader

Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding


The abc.txt File Contents are

ABCDEFGHIJ•XYZ

Now, The Character Shown is Fine if I use this code (i.e. Seek to position 9),

            string filePath = "D:\\abc.txt";
            FileStream fs = new FileStream(filePath, FileMode.Open);
            StreamReader sr = new StreamReader(fs, new UTF8Encoding(true), true);
            sr.BaseStream.Seek(9, SeekOrigin.Begin);
            char[] oneChar = new char[1];
            char ch = (char)sr.Read(oneChar, 0, 1);
            MessageBox.Show(oneChar[0].ToString());

But if the SEEK position is Just after that Special Dot Character, then I Get Junk Character.

So, I get Junk Character if I do Seek to position 11 (i.e. just after the dot position)

sr.BaseStream.Seek(11, SeekOrigin.Begin);

This should give 'X', because the character at 11th position is X.

I think the File contents are legally UTF8.

There is also one more thing, The StreamReader BaseStream length and the StreamReader Contents Length is different.

   MessageBox.Show(sr.BaseStream.Length.ToString());
   MessageBox.Show(sr.ReadToEnd().Length.ToString());

Solution

  • Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding

    It is exactly because of UTF-8 that sr.BaseStream is giving junk characters. :)

    StreamReader is a relatively "smarter" stream. It understands how strings work, whereas FileStream (i.e. sr.BaseStream) doesn't. FileStream only knows about bytes.

    Since your file is encoded in UTF-8 (a variable-length encoding), letters like A, B and C are encoded with 1 byte, but the character needs 3 bytes. You can get how many bytes a character needs by doing:

    Console.WriteLine(Encoding.UTF8.GetByteCount("•"));
    

    So when you move the stream to "the position just after ", you haven't actually moved past the , you are just on the second byte of it.

    The reason why the Lengths are different is similar: StreamReader gives you the number of characters, whereas sr.BaseStream gives you the number of bytes.