Search code examples
c#.netcachingfile-read

C# reading sequentially text files from SSD drive is very slow the first time


I am experiencing very slow sequential read of text files from SSD only the first time during a Windows session. Then the second and consecutive times reading is more than 60x faster (in second and subsequent run of the application, which means different Windows processes). I am pretty sure this is a normal consequence of system and/or SSD caching optimization but I ask a question here to double check this guess with the community because I didn't find anything by googling.

Question: Can something be done to make the first time read faster?


Answer:

I update this question since thanks to the @Alois Kraus answer we found the culprit which is Windows Defender file check upon file read!!

This Windows Defender slow check can be disabled through Start > Settings > Update & Security > Windows Security > Virus & threat protection. Under Virus & threat protection settings, select Manage settings, and then under Exclusions, select Add an exclusions. I can solve the peculiar issue either through Folder Exclusion (it works recursively under the folder) or File Type, and put the file extension(s) on which to disable slow check.

Now remain the questions:

    1. how to detect programmatically Windows Defenders is slowing our code, so we can emit a warning to user and explain what to do
    1. disable Windows Defender slow check programmatically (which I am pretty sure is impossible, else this would be a door opened for virus!)

Windows Defender Exclusion


Details:

I reproduce the issue on 2 distinct machines one with SAMSUNG SSD MZVLB1T0HALR 1TB (max read speed 3200MB/sec) one with SAMSUNG SSD MZVL21T0HCLR 1TB (max read speed 7000MB/sec).

With a classic File.ReadAllText(filePathStr) call I get the first time 33.555sec for 4.737 files (screenshot from dotTrace). All 4.7K text files cumulated weight 28MB on disk, this is less than 1MB read/sec. enter image description here

Then second and subsequent time it is more than 60x faster, 540ms instead of 33sec, around 60MB read/sec (still very far from the SSD max read speed 3200MB/sec announced, but we read 4.7K files instead of just one). enter image description here


I know File.ReadAllText() has some limitations (like assuming text file is UTF8 encoded which is ok in our context) so I tried something like:

FileInfo fileInfo = new FileInfo(filePathStr);
long length = fileInfo.Length;
int bufferSize = length >= int.MaxValue ? int.MaxValue : (int)length;
using (FileStream fs = File.Open(filePathStr, FileMode.Open, FileAccess.Read))
using (StreamReader streamReader = new StreamReader(
          fs, Encoding.UTF8, true, bufferSize, false)) {
   content = streamReader.ReadToEnd();
   failureReason = null;
   return true;
}

...with same super slow performance the first time, and slower performance on second and subsequent times (853ms instead of 540ms):

enter image description here


Finally I did a test with this code:

byte[] bytes = File.ReadAllBytes(filePathStr);
string content = File.ReadAllText(filePathStr);

On first run I measured this, which advocates that this has nothing to do with text reading but only with file reading for the first time:

enter image description here


I did more tests on file read with Roslyn SourceReferenceResolver.ReadText(string resolvedPath):SourceText and got the exact same result: 60x to 80x slower on 4.700 source files read depending if it is performed first time after Windows boot or second or subsequent time (from 23sec to 270ms on the beefy new machine!). I guess if there was a way to achieve better performance Roslyn developers would have implemented it since they are both expert and so concerned with performances.


Some suggested to go async so I tried something like:

var tasks = new List<Task<string>>();
foreach (string path in paths) {
      var task = File.ReadAllTextAsync(path);
      tasks.Add(task);
}
foreach(var task in tasks) {
   string str = await task;
}

but results are same that by using File.ReadAllText(path);: same duration first time in Wnd session, same duration second and subsequent time in Wnd session.


Solution

  • You need to profile. But given that big slowdown it is most likely AV Scanner overhead. AV scanners tend to intercept file IO and scan the file contents which takes time. To proof you need to start in an elevated command shell

    wpr start CPU -start DiskIO -start FileIO
    Run your test application or execute the use case from your already running application
    wpr -stop c:\temp\SlowFileRead.etl
    

    You can repeat that for the fast case again and name the file accordingly like FastSecondRead.etl

    Wpr is part of Windows and works reliable since Win 11. For older OS you need to install the Windows Performance Toolkit from the Windows SDK (https://developer.microsoft.com/en-us/windows/downloads/windows-sdk/ the Win 11 WPT also works on Windows 10).

    Then you can download ETWAnalyzer (I did write it) https://github.com/Siemens-Healthineers/ETWAnalyzer/releases and do

    ETWAnalyzer -extract all -fd c:\temp\SlowFileRead.etl -symserver ms
    cd c:\temp\Extract
    ETWAnalyzer -dump CPU -pn YourAppName.exe -stacktags virus
    

    Below is an example image how it will look like: enter image description here

    For more queries see help of ETWAnalyzer and this blog post:

    I have a small test application which creates 5K files and writes to them which is also very slow (sometimes) which is ETW profiling enabled to show what is influencing the performance:

    Again this is definitely not SSD or file system cache related. To read 28 MB in 33 s you end up with 0.8 MB/s which is several magnitudes away from any SSD performance limit.

    When ETWAnalyzer shows high defender wait times for your test process then you have found the issue. If not then you need to dump the methods and sort by wait time with

    ETWAnalyzer -dump CPU -pn YourAppName.exe -methods *  -SortBy Wait -topNMethods 150
    ETWAnalyzer -dump CPU -pn YourAppName.exe -methods *.sys;*.dll  -SortBy Wait -smi
    

    The second query will display all device drivers and dlls which have no symbols which all AV scanners do not provide which is a good way to quickly identify new AV drivers from other vendors.

    You can also export the data to csv by adding -csv waitTimes.csv and continue your analysis in Excel if you want to. I would check ReadFile/CreateFile/CloseHandle wait times and methods with similar wait times which are most likely part of the call stack. Then you will find the offending intercepting drivers or other things. If you are checking for a specific driver version you can add -smi which is the shortcurt for -ShowModuleVersion to show module versions on each line quickly.

    Antivirus is usually spending most time in CreateFile so I would start looking there first if you find a file system filter driver which intercepts your calls with similar timings.

    To answer the question what you can do:

    You can add an exclusion rule if your application is running elevated. For e.g. Defender you can add an exclusion rule via the powershell interface:

    Add-MpPreference -ExclusionPath "C:\temp\Test"
    

    You can also ask the user to add an AV exclusion for your data files you are going to read to speed up things. This is a good thing anyway because build times go down the river if AV scanners are running on build directories where frequently new executables pop up and go away which is attracting AV scanners like a dead horse in the desert is attracting vultures.

    Update what you can do

    One possiblity is to do a self speed test with a known data set. You can e.g. create a folder in your target directory fill it with 5000 files and measure how long it takes. This will not test disk speed because the writes will be written async by the OS as long as the modified bytes are below a specific threshold. If you are much slower than you did anticipate you can issue a warning. The next step is to check the installed drivers in C:\Windows\System32\drivers. ETWAnalyzer knows practially all AV Drivers in its list: https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Configuration/WellKnownDrivers.json which you can also employ to pinpoint the user to create an exclusion for his AV solution. The problem is the drivers are used in different products and companies tend to aggregate and merge but the drivers remain.