Search code examples
c#unzip

Slow unzipping of text files using c# dotnetzip .NET 4.0


I am making a method to extract information from zipped files. All the zip files will contain just one text file. It is the intend that method should return a string array.

I am using dotnetzip, but i am experiencing a horrable performance. I have tried to benchmark the performance of each step and seems to be performing slowly on all steps.

The c# code is:

        public string[] LoadZipFile(string FileName)
    {
        string[] lines = { };
        int start = System.Environment.TickCount;
        this.richTextBoxLOG.AppendText("Reading " + FileName + "... ");
        try
        {
            int nstart;

            nstart = System.Environment.TickCount;       
            ZipFile zip = ZipFile.Read(FileName);
            this.richTextBoxLOG.AppendText(String.Format("ZipFile ({0}ms)\n", System.Environment.TickCount - nstart));

            nstart = System.Environment.TickCount;
            MemoryStream ms = new MemoryStream();
            this.richTextBoxLOG.AppendText(String.Format("Memorystream ({0}ms)\n", System.Environment.TickCount - nstart));

            nstart = System.Environment.TickCount;
            zip[0].Extract(ms);
            this.richTextBoxLOG.AppendText(String.Format("Extract ({0}ms)\n", System.Environment.TickCount - nstart));

            nstart = System.Environment.TickCount;
            string filecontents = string.Empty;
            using (var reader = new StreamReader(ms)) 
            { 
                reader.BaseStream.Seek(0, SeekOrigin.Begin); 
                filecontents = reader.ReadToEnd().ToString(); 
            }
            this.richTextBoxLOG.AppendText(String.Format("Read ({0}ms)\n", System.Environment.TickCount - nstart));

            nstart = System.Environment.TickCount;
            lines = filecontents.Replace("\r\n", "\n").Split("\n".ToCharArray());
            this.richTextBoxLOG.AppendText(String.Format("SplitLines ({0}ms)\n", System.Environment.TickCount - nstart));
        }
        catch (IOException ex)
        {
            this.richTextBoxLOG.AppendText(ex.Message+ "\n"); 

        }
        int slut = System.Environment.TickCount;
        this.richTextBoxLOG.AppendText(String.Format("Done ({0}ms)\n", slut - start)); 
        return (lines);

As an example I get this output:

Reading xxxx.zip... ZipFile (0ms) Memorystream (0ms) Extract (234ms) Read (78ms) SplitLines (187ms) Done (514ms)

A total of 514 ms. When the same operation is performed in python 2.6 using this code:

def ReadZip(File):
z = zipfile.ZipFile(File, "r")
name =z.namelist()[0]
return(z.read(name).split('\r\n'))

It executes in just 89 ms. Any ideas on how to improve performance is very welcome.


Solution

  • Thanks for the suggestions. I enden up changing the code in a few ways:

    • Using a collection.generic to return lines
    • using streamreader.readline

    Removing logging and exception handling did not change performance much. I looked at sharplibs unzip library, but it looked a little more complicated to implement and from what I could read on other post there was maybe a little gain in unzipping. It is now running at around 300ms.

            public List<string> LoadZipFile2(string FileName)
        {
            List<string> lines = new List<string>();
            int start = System.Environment.TickCount;
            string debugtext;
            debugtext = "Reading " + FileName + "... ";
            this.richTextBoxLOG.AppendText(debugtext);
    
            try
            {
                //int nstart = System.Environment.TickCount;
                ZipFile zip = ZipFile.Read(FileName);
               // this.richTextBoxLOG.AppendText(String.Format("ZipFile ({0}ms)\n", System.Environment.TickCount - nstart));
    
                //nstart = System.Environment.TickCount;
                MemoryStream ms = new MemoryStream();
                //this.richTextBoxLOG.AppendText(String.Format("Memorystream ({0}ms)\n", System.Environment.TickCount - nstart));
    
                //nstart = System.Environment.TickCount;
                zip[0].Extract(ms);
                zip.Dispose();
                //this.richTextBoxLOG.AppendText(String.Format("Extract ({0}ms)\n", System.Environment.TickCount - nstart));
    
                //nstart = System.Environment.TickCount;
                using (var reader = new StreamReader(ms))
                {
                    reader.BaseStream.Seek(0, SeekOrigin.Begin);
                    while (reader.Peek() >= 0)
                    {
                        lines.Add(reader.ReadLine());
                    }
                }
                ;
                //this.richTextBoxLOG.AppendText(String.Format("Read ({0}ms)\n", System.Environment.TickCount - nstart));
            }
            catch (IOException ex)
            {
                this.richTextBoxLOG.AppendText(ex.Message + "\n");
            }
            int slut = System.Environment.TickCount;
            this.richTextBoxLOG.AppendText(String.Format("Done ({0}ms)\n", slut - start));
            return (lines);