Search code examples
c#filestreamstreamreader

Seek through FileStream then using StreamReader to read from there


So I want to be able to seek to a point in a fileStream, then read forward using a StreamReader. Then seek forward again, and use the StreamReader to read another chunk of data.

const int BufferSize = 4096;
var buffer = new char[BufferSize];

var endpoints = new List<long>();

using (var fileStream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
{ 
    var fileLength = fileStream.Length;

    var seekPositionCount = fileLength / concurrentReads;

    long currentOffset = 0;
    for (var i = 0; i < concurrentReads; i++)
    {
        var seekPosition = seekPositionCount + currentOffset;

        // seek the file forward
        fileStream.Seek(seekPosition, SeekOrigin.Current);

        // setting true at the end is very important, keeps the underlying fileStream open.
        using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize, true))
        {
            // this also seeks the file forward the amount in the buffer...
            int bytesRead;
            var totalBytesRead = 0;
            while ((bytesRead = await streamReader.ReadAsync(buffer, 0, buffer.Length)) > 0)
            {
                totalBytesRead += bytesRead;

                var found = false;

                var gotR = false;

                for (var j = 0; j < buffer.Length; j++)
                {
                    if (buffer[j] == '\r')
                    {
                        gotR = true;
                        continue;
                    }

                    if (buffer[j] == '\n' && gotR)
                    {
                        // so we add the total bytes read, minus the current buffer amount read, then add how far into the buffer we actually read.
                        seekPosition += totalBytesRead - BufferSize + j;
                        endpoints.Add(seekPosition);
                        found = true;
                        break;
                    }
                }

                if (found) break;
            }
        }
        
        // we need to seek to the position we got to in the StreamReader (but not going by how much was read).
        fileStream.Seek(seekPosition, SeekOrigin.Current);

        currentOffset += seekPosition;
    }
}

return endpoints;

However, I get to two entries in endpoints and it exits out.

(bytesRead = await streamReader.ReadAsync(buffer, 0, buffer.Length)) > 0

The arguments you pass to ReadAsync I thought are solely to do with the buffer, so the index argument I thought was to say, fill the buffer at index.

I can't make out from Reference Source how this value is used.

I assumed (and can't find the evidence to back up) that, when you opened a StreamReader it uses the underlying Stream as it's guide, so when you ask to read some bytes, it will start from the position the underlying Stream is at...

But the results of what I'm doing aren't showing that, they seem to be showing that the StreamReader is starting at the beginning of the Stream each time - however, I can't find the evidence to support that is how it does it either...

Seeking

Is my understanding of seeking correct, in the sense that if I call seek

fileStream.Seek(seekPosition, SeekOrigin.Current);

If the file is at 300, I want to seek to 600, the above variable seekPosition should be 600??

ReferenceSource would say otherwise:

else if (origin == SeekOrigin.Current) {
    // Don't call FlushRead here, which would have caused an infinite
    // loop.  Simply adjust the seek origin.  This isn't necessary
    // if we're seeking relative to the beginning or end of the stream.
    offset -= (_readLen - _readPos);
}

Solution

  • So thanks to Hans Passant, I have got the answer:

    var buffer = new char[BufferSize];
    
    var endpoints = new List<long>();
    
    using (var fileStream = this.CreateMultipleReadAccessFileStream(fileName))
    {
        var fileLength = fileStream.Length;
    
        var seekPositionCount = fileLength / concurrentReads;
    
        long currentOffset = 0;
        for (var i = 0; i < concurrentReads; i++)
        {
            var seekPosition = seekPositionCount + currentOffset;
    
            // seek the file forward
            // fileStream.Seek(seekPosition, SeekOrigin.Current);
    
            // setting true at the end is very important, keeps the underlying fileStream open.
            using (var streamReader = this.CreateTemporaryStreamReader(fileStream))
            {
                // this is poor on performance, hence why you split the file here and read in new threads.
                streamReader.DiscardBufferedData();
                // you have to advance the fileStream here, because of the previous line
                streamReader.BaseStream.Seek(seekPosition, SeekOrigin.Begin);
                // this also seeks the file forward the amount in the buffer...
                int bytesRead;
                var totalBytesRead = 0;
                while ((bytesRead = await streamReader.ReadAsync(buffer, 0, buffer.Length)) > 0)
                {
                    totalBytesRead += bytesRead;
    
                    var found = false;
    
                    var gotR = false;
    
                    for (var j = 0; j < buffer.Length; j++)
                    {
                        if (buffer[j] == '\r')
                        {
                            gotR = true;
                            continue;
                        }
    
                        if (buffer[j] == '\n' && gotR)
                        {
                            // so we add the total bytes read, minus the current buffer amount read, then add how far into the buffer we actually read.
                            seekPosition += totalBytesRead - BufferSize + j;
                            endpoints.Add(seekPosition);
                            found = true;
                            break;
                        }
                        // if we have found new line then move the position to 
                    }
    
                    if (found) break;
                }
            }
    
            currentOffset = seekPosition;
        }
    }
    
    return endpoints;
    

    Note the new part, rather than doing this twice:

    fileStream.Seek(seekPosition, SeekOrigin.Current);
    

    I now use SeekOrigin.Begin and use the StreamReader to progress the underlying base stream:

    // this is poor on performance, hence why you split the file here and read in new threads.
    streamReader.DiscardBufferedData();
    // you have to advance the fileStream here, because of the previous line
    streamReader.BaseStream.Seek(seekPosition, SeekOrigin.Begin);
    

    The DiscardBufferedData will mean that I'm always using the underlying stream position.