I'm using a finite-state machine to read a extra large file. It's not multi-threaded, so there won't be any problem of thread safety.
It contains 3 kinds of content:
I've found this question that might be useful, but it failed. The similiar python question is neither useful, because it won't throw any error. I have to read the content with proper encoding, or the behavior will go unknown.
Currently, i'm using StreamReader, but the CurrentEncoding property cannot be changed, once the StreamReader is initialized.
So i've also tried to recreate the StreamReader on the same Stream:
reader = new StreamReader(stream, encoding65001); //UTF-8
DoSomething(reader);
reader = new StreamReader(stream, encoding1252); //ANSI
DoSomething(reader);
reader = new StreamReader(stream, encoding936); //ANSI
//...
But it starts to read strange content from an unknown position. I haven't find out the possible cause for this strange behavior.
Have I made mistake on creating multiple StreamReader, or it is designed not to create multiple on the same stream?
If it is designed so, is there any solution for reading such file?
Thank you for the time reading.
Edit: I've run the following code on .NET Core 3.1:
Stream stream = File.OpenRead(testFilePath);
Console.WriteLine(stream.Position);
Console.WriteLine(stream.ReadByte());
Console.WriteLine(stream.Position + "\r\n");
StreamReader reader = new StreamReader(stream, Encoding.UTF8);
Console.WriteLine(reader.Read());
Console.WriteLine(stream.Position + "\r\n");
reader = new StreamReader(stream, CodePagesEncodingProvider.Instance.GetEncoding(1252));
Console.WriteLine(reader.Read());
Console.WriteLine(stream.Position);
With the example text of following:
abcdefg
And the output:
0
97
1
98
7
-1
7
It's strange and interesting.
The stream readers are going to buffer the content from the underlying stream they're reading, which is what's causing your problems. Just because you read one character from your reader doesn't mean it'll read just one character from the underlying stream. It'll fill a while buffer with bytes, and then yield you one character from the buffer.
If you want to be reading values from a stream and interpreting different sections of bytes as different encodings (for the record, if at all possible you should avoid putting yourself in this position of having mixed encodings in your data) you'll have to pull the bytes out of the stream yourself and then convert the bytes using the appropriate encodings, so that you can be sure you only pull the exact sections of bytes you want and no more.