Search code examples
c#performanceiofilestreambinaryreader

Binaryreader read from Filestream which loads in chunks


I'm reading values from a huge file (> 10 GB) using the following code:

FileStream fs = new FileStream(fileName, FileMode.Open);
BinaryReader br = new BinaryReader(fs);

int count = br.ReadInt32();
List<long> numbers = new List<long>(count);
for (int i = count; i > 0; i--)
{
    numbers.Add(br.ReadInt64());
}

unfortunately the read-speed from my SSD is stuck at a few MB/s. I guess the limit are the IOPS of the SSD, so it might be better to read in chunks from the file.

Question

Does the FileStream in my code really read only 8 bytes from the file everytime the BinaryReader calls ReadInt64()?

If so, is there a transparent way for the BinaryReader to provide a stream that reads in larger chunks from the file to speed up the procedure?

Test-Code

Here's a minimal example to create a test-file and to measure the read-performance.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;

namespace TestWriteRead
{
    class Program
    {
        static void Main(string[] args)
        {
            System.IO.File.Delete("test");
            CreateTestFile("test", 1000000000);

            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            IEnumerable<long> test = Read("test");
            stopwatch.Stop();
            Console.WriteLine("File loaded within " + stopwatch.ElapsedMilliseconds + "ms");
        }

        private static void CreateTestFile(string filename, int count)
        {
            FileStream fs = new FileStream(filename, FileMode.CreateNew);
            BinaryWriter bw = new BinaryWriter(fs);

            bw.Write(count);
            for (int i = 0; i < count; i++)
            {
                long value = i;
                bw.Write(value);
            }

            fs.Close();
        }

        private static IEnumerable<long> Read(string filename)
        {
            FileStream fs = new FileStream(filename, FileMode.Open);
            BinaryReader br = new BinaryReader(fs);

            int count = br.ReadInt32();
            List<long> values = new List<long>(count);
            for (int i = 0; i < count; i++)
            {
                long value = br.ReadInt64();
                values.Add(value);
            }

            fs.Close();

            return values;
        }
    }
}

Solution

  • You should configure the stream to use SequentialScan to indicate that you will read the stream from start to finish. It should improve the speed significantly.

    Indicates that the file is to be accessed sequentially from beginning to end. The system can use this as a hint to optimize file caching. If an application moves the file pointer for random access, optimum caching may not occur; however, correct operation is still guaranteed.

    using (
        var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
            FileOptions.SequentialScan))
    {
        var br = new BinaryReader(fs);
        var count = br.ReadInt32();
        var numbers = new List<long>();
        for (int i = count; i > 0; i--)
        {
            numbers.Add(br.ReadInt64());
        }
    }
    

    Try read blocks instead:

    using (
    var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
    FileOptions.SequentialScan))
    {
        var br = new BinaryReader(fs);
        var numbersLeft = (int)br.ReadInt64();
        byte[] buffer = new byte[8192];
        var bufferOffset = 0;
        var bytesLeftToReceive = sizeof(long) * numbersLeft;
        var numbers = new List<long>();
        while (true)
        {
            // Do not read more then possible
            var bytesToRead = Math.Min(bytesLeftToReceive, buffer.Length - bufferOffset);
            if (bytesToRead == 0)
                break;
            var bytesRead = fs.Read(buffer, bufferOffset, bytesToRead);
            if (bytesRead == 0)
                break; //TODO: Continue to read if file is not ready?
    
            //move forward in read counter
            bytesLeftToReceive -= bytesRead;
            bytesRead += bufferOffset; //include bytes from previous read.
    
            //decide how many complete numbers we got
            var numbersToCrunch = bytesRead / sizeof(long);
    
            //crunch them
            for (int i = 0; i < numbersToCrunch; i++)
            {
                numbers.Add(BitConverter.ToInt64(buffer, i * sizeof(long)));
            }
    
            // move the last incomplete number to the beginning of the buffer.
            var remainder = bytesRead % sizeof(long);
            Buffer.BlockCopy(buffer, bytesRead - remainder, buffer, 0, remainder);
            bufferOffset = remainder;
        }
    }
    

    Update in response to a comment:

    May I know what's the reason that manual reading is faster than the other one?

    I don't know how the BinaryReader is actually implemented. So this is just assumptions.

    The actual read from the disk is not the expensive part. The expensive part is to move the reader arm into the correct position on the disk.

    As your application isn't the only one reading from a hard drive the disk have to re-position itself every time an application requests a read.

    Thus if the BinaryReader just reads the requested int it have to wait on the disk for every read (if some other application make a read in-between).

    As I read a much larger buffer directly (which is faster) I can process more integers without having to wait for the disk between reads.

    Caching will of course speed things up a bit, and that's why it's "just" three times faster.

    (future readers: If something above is incorrect, please correct me).