Search code examples
filehelpers

FileHelpers: using ReadStream bypass two records at each iteration


I have a file defined like this

ID  TEXT
1   XXXX
2   XXXX
3   XXXX
4   XXXX
5   XXXX
6   XXXX
7   XXXX
8   XXXX
9   XXXX
10  XXXX

And the class for the filehelper defined like this

[DelimitedRecord("\t")]
    public class TestItem
    {
        public int Id;
        public string Text;
    }

I read the file with the following code

FileHelperEngine<TestItem> eng = new FileHelperEngine<TestItem>();
            using (var file = new FileStream("FILEPATH", FileMode.Open, FileAccess.Read))
            {
                //I've declared like this because filehelper close the reader after each iteration
                StreamReader reader = new StreamReader(file, Encoding.UTF8, false, 1024, true);
                eng.Options.IgnoreFirstLines = 1;

                TestItem[] content = null;
                bool headerRead = false;
                do
                {
                    content = eng.ReadStream(reader, 2);
                    if (!headerRead)
                    {
                        headerRead = true;
                        eng.Options.IgnoreFirstLines = 0;
                    }
                }
                while (!reader.EndOfStream);
            }

read, as you can see, 2 record each time, and ignore the firl line at the first iteration. But at the second iteration, i'm expecting to obtain record 3 and record 4, but, instead, i receive back record 5 and 6. Why this? How to solve this?


Solution

  • The problem is with your usage of the ReadStream function. It is designed to read a file for the maximum number of records and then be closed. As such, it recreates a ForwardReader on every call.

    The way ForwardReader.ReadNextLine() works is to pass the current value back, and read in the next line read for processing. So what happens is this:

    1. First call to ReadStream, first line is already read in by ForwardReader as it is created.
    2. currentLine is set by ForwardReader.ReadNextLine() which is the header record
    3. IgnoreFirst is set, so it calls ReadNextLine which in turns returns the line previously read, the updates its cache with the next line. Thus current line becomes line 2 (ID 1) since we skipped line 1.
    4. We loop through each line, adding the currentLine to the arrow before updating it via ForwardReader.ReadNextLine().
    5. Once the maximum records is reached, we exit. At this point, currentLine is row 4 (ID 3), and the cache already has row 5 (ID 4).

    So, if you were only making the one call, everything would be fine and this would be as expected. However, because you make another call, this happens:

    1. StreamReader is already at the beginning of Line 6 since we read the rest above.
    2. Calling ReadStream creates a brand new ForwardReader, which in turn automatically reads the next record into its buffer, thus has row 6 (ID 5)
    3. Maximum records is two, so we also get row 7 (ID 6)
    4. At this point, we now have row 8 in the current buffer, row 9 in the cache and the streamreader is waiting at row 10.

    In some regards this is a bug, but in others its your usage of ReadStream that's not really correct. The better way would be to ditch the maximum number of records to read, and instead make use of the INotifyRead/INotifyWrite functionality if you need to process on a per record basis.