Search code examples
.netmultithreadingziptask-parallel-libraryparallel.foreach

Process ZipArchive entries in parallel


I need to process ZipArchive entries as strings. Currently I have code like this:

using (ZipArchive archive = ZipFile.OpenRead(zipFileName))
{
    foreach (ZipArchiveEntry entry in archive.Entries)
    {
        using (StreamReader sr = new StreamReader(entry.Open()))
        {
            string s = sr.ReadToEnd();
            // doing something with s
        }
    }
}

The processing could be much faster if it was done on several CPU cores in parallel using Parallel.ForEach or a similar loop. The problem is that ZipArchive is not thread-safe.

Perhaps, we could use the Partitioner class to get ranges from ZipArchive.Entries to feed them into a Parallel.ForEach loop and then open the zip archive again and every entry in the loop body using a new instance of ZipArchive to be thread-safe, but I have no good idea how to do that. Is it possible?

If not, is there another reliable way to process zip archive entries in parallel if we just need to read them?


Solution

  • If my assumption is right, the multi-threaded version of my code processing a ZipArchive should look like this:

    using (ZipArchive archive = ZipFile.OpenRead(zipFileName))
    {
        var ranges = Partitioner.Create(0, archive.Entries.Count);
        
        Parallel.ForEach(ranges, range =>
        {
            using (ZipArchive archive2 = ZipFile.OpenRead(zipFileName))
                for (int i = range.Item1; i < range.Item2; i++)
                {
                    ZipArchiveEntry entry = archive2.Entries[i];
                    using (StreamReader sr = new StreamReader(entry.Open()))
                    {
                        string s = sr.ReadToEnd();
                        // doing something with s
                    }
                }
        }
    }
    

    P.S. Just for general information. My time measurements show that this version of code works 25%-40% slower compared to the original one-threaded code. So that's a question whether we should process a zip archive from multiple threads. Don't forget to measure performance of multi-threaded code for your archives to be sure that this approach helps to boost performance.