Need to efficiently scans for low end ascii control characters on large excel files

I am working on an ETL Validation procedure to scan for invalid ascii chars prior to processing. In this case, invalid is defined as ascii chars in the 0-31 range.

In a C# ETL validation service I am using OfficeOpenXml to examine the contents of the excel files.

Does anyone know of a more efficient way to search the contents besides looping each worksheet, each column and each row? The files can be very large and the validation should be as fast a possible.

Is there access to the raw xml buffer? Would it be faster to un-zip the xml files and scan the content there?

Solution

First, I think it is time for the Speed Rant: https://ericlippert.com/2012/12/17/performance-rant/

The question is, where the bottleneck is right now. My instict tells me it should be the disk. You are working with files, so it usually is the Disk. If that is the case, short of loading each cell into memory only once, there is not much to speed up.

You are however doing a pretty deep processing of strings, as you will have to go through every char. So there might be a relevant amount of time spend on that. Most likely not the bottleneck, but something you can negate as cost.

You might be able to do some asynchronisation, having the next cell/row loading in background while you process this one. A approach like Directory.EnumerateFiles() vs Directory.GetFiles might work: https://learn.microsoft.com/en-us/dotnet/api/system.io.directory.enumeratefiles

Row does seem to have a GetEnumerator function. But it might just be there to get a Enumerator for code that requires Enuemrators and not actually include defered/background loading (i.e. like the Enumerator implicitly created for foreach loops).