windows-server-2008-r2 .net-4.5 ntfs c#-5.0

How to improve throughput with FileStream in a single-threaded application

I am trying to get top I/O performance in a data streaming application with eight SSDs in RAID-5 (each SSD advertises and delivers 500 MB/sec reads).

I create FileStream with 64KB buffer and read many blocks in a blocking fashion (pun not intended). Here's what I have now with 80GB in 20K files, no fragments: Legacy blocking reads are at 1270 MB/sec with single thread, 1556 MB/sec with 6 threads.

What I noticed with single-thread is that a single core's worth of CPU time is spent in kernel (8.3% red in Process Explorer with 12 cores). With 6 threads, approximately 5x CPU time is spent in kernel (41% red in in Process Explorer with 12 cores).

I would really like to avoid complexity of a multi-threaded application in the I/O bound scenario.

Is it possible to achieve these transfer rates in a single-threaded application? That is, what would be a good way to reduce the amount of time in kernel mode?

How, if at all, would the new Async feature in C# help?

For comparison, ATTO disk benchmark shows 2500 MB/sec at these block sizes on this hardware and low CPU utilization. However, ATTO dataset size is mere 2GB.

Using LSI 9265-8i RAID controller, with 64k stripe size, 64k cluster size.

summary image

i/o counts

QD10

ATTO Single request

Here's a sketch of the code in use. I don't write production code this way, it's just a proof of concept.

   volatile bool _somethingLeftToRead = false;
   long _totalReadInSize = 0;
   void ProcessReadThread(object obj)
   {
      TestThreadJob job = obj as TestThreadJob;
      var dirInfo = new DirectoryInfo(job.InFilePath);
      int chunk = job.DataBatchSize * 1024;

      //var tile = new List<byte[]>();

      var sw = new Stopwatch();

      var allFiles = dirInfo.GetFiles();

      var fileStreams = new List<FileStream>();
      long totalSize = 0;
      _totalReadInSize = 0;

      foreach (var fileInfo in allFiles)
      {
         totalSize += fileInfo.Length;
         var fileStream = new FileStream(fileInfo.FullName,
             FileMode.Open, FileAccess.Read, FileShare.None, job.FileBufferSize * 1024);

         fileStreams.Add(fileStream);
      }

      var partial = new byte[chunk];
      var taskParam = new TaskParam(null, partial);
      var tasks = new List<Task>();
      int numTasks = (int)Math.Ceiling(fileStreams.Count * 1.0 / job.NumThreads);
      sw.Start();

      do
      {
         _somethingLeftToRead = false;

         for (int taskIndex = 0; taskIndex < numTasks; taskIndex++)
         {
            if (_threadCanceled)
               break;
            tasks.Clear();
            for (int thread = 0; thread < job.NumThreads; thread++)
            {
               if (_threadCanceled)
                  break;
               int fileIndex = taskIndex * job.NumThreads + thread;
               if (fileIndex >= fileStreams.Count)
                  break;
               var fileStream = fileStreams[fileIndex];

               taskParam.File = fileStream;
               if (job.NumThreads == 1)
                  ProcessFileRead(taskParam);
               else
                  tasks.Add(Task.Factory.StartNew(ProcessFileRead, taskParam));

               //tile.Add(partial);
            }
            if (_threadCanceled)
               break;
            if (job.NumThreads > 1)
               Task.WaitAll(tasks.ToArray());
         }

         //tile = new List<byte[]>();
      }
      while (_somethingLeftToRead);

      sw.Stop();

      foreach (var fileStream in fileStreams)
         fileStream.Close();

      totalSize = (long)Math.Round(totalSize / 1024.0 / 1024.0);
      UpdateUIRead(false, totalSize, sw.Elapsed.TotalSeconds);
   }

   void ProcessFileRead(object taskParam)
   {
      TaskParam param = taskParam as TaskParam;
      int readInSize;
      if ((readInSize = param.File.Read(param.Bytes, 0, param.Bytes.Length)) != 0)
      {
         _somethingLeftToRead = true;
         _totalReadInSize += readInSize;
      }
   }

Solution

There's a number of issues here.

First, I see that you are not trying to use non-cached I/O. This means that the system will try to cache your data in RAM and service reads out of it. SO you get an extra data transfer out of things. Do non-cached I/O.

Next, you appear to be creating/destroying threads inside a loop. This is inefficient.

Lastly, you need to investigate the alignment of the data. Crossing read-block boundaries can add to your costs.

I would advocate using non-cached, async I/O. I'm not sure how to accomplish this in C# (but it should be easy).

EDITED: Also, why are you using RAID 5? Unless the data is write-once, this is likely to have hideous performance on SSDs. Notably, the erase block size is typically 512K, meaning when you write something smaller, the SSD will need to read the 512K in its firmware, change the data, and then write it somewhere else. You might want to make the stripe size = size of erase block. Also, you should check to see what the alignment of the writes are as well.