Search code examples
asp.netsqlsql-server-2008compressionsharpziplib

Most Space-Efficient Way to Store a Byte Array in a Database Table - ASP.NET


Right now we have a database table (SQL Server 2008 R2) that stores an uploaded file (PDF, DOC, TXT, etc.) in an image type column. A user uploads this file from an ASP.NET application. My project is to get a handle on the size at which this table is growing, and I've come up with a couple of questions along the way.

  1. On the database side, I've discovered the image column type is supposedly somewhat depreciated? Will I gain any benefits to switching over to varbinary(max), or should I say varbinary(5767168) because that is my file size cap, or might as well I just let it stay as an image type as far as space-efficiency is concerned?

  2. On the application side, I want to compress the byte array. Microsoft's built in GZip sometimes made the file bigger instead of smaller. I switched over to SharpZipLib, which is much better, but I still occasionally run into the same problem. Is there a way to find out the average file compression savings before I implement it on a wide scale? I'm having a hard time finding out what the underlying algorithm is that they use.

  3. Would it be worth writing a Huffman code algorithm of my own, or will that present the same problem where there is occasionally a larger compressed file than original file?

For reference, in case it matters, here's the code in my app:

    using ICSharpCode.SharpZipLib.GZip;

    private static byte[] Compress(byte[] data)
    {
        MemoryStream output = new MemoryStream();

        using (GZipOutputStream gzip = new GZipOutputStream(output))
        {
            gzip.IsStreamOwner = false;
            gzip.Write(data, 0, data.Length);
            gzip.Close();
        }
        return output.ToArray();
    }

    private static byte[] Decompress(byte[] data)
    {
        MemoryStream output = new MemoryStream();
        MemoryStream input = new MemoryStream();
        input.Write(data, 0, data.Length);
        input.Position = 0;

        using (GZipInputStream gzip = new GZipInputStream(input))
        {
            byte[] buff = new byte[64];
            int read = gzip.Read(buff, 0, buff.Length);

            while (read > 0)
            {
                output.Write(buff, 0, read);
                read = gzip.Read(buff, 0, buff.Length);
            }

            gzip.Close();
        }
        return output.ToArray();
    }

Thanks in advance for any help. :)


Solution

  • I hate to be a jerk and answer my own question, but I thought I'd summarize my findings into a complete answer for anyone else looking to space-efficiently store file/image data within a database:

    * Using varbinary(MAX) versus Image?

    Many reasons for using varbinary(MAX), but top among them is that Image is deprecated and in a future version of SQL it will be removed altogether. Not starting any new projects with it is just nipping a future problem in the bud.

    According to the info in this question: SQL Server table structure for storing a large number of images, varbinary(MAX) has more operations available to be used on it.

    Varbinary(MAX) is easy to stream from a .NET application by using an SQL Parameter. Negative one is for 'MAX' length. Like so:

    SQLCommand1.Parameters.Add("@binaryValue", SqlDbType.VarBinary, -1).Value = compressedBytes;
    

    * What compression algorithm to use?

    I'm really not much closer to a decent answer on this one. I used ICSharpCode.SharpZipLib.Gzip and found it had better performance than the built in zipping functions simply by running it on a bunch of stuff and comparing it.

    My results:

    I reduced my total file size by about 20%. Unfortunately, a lot of the files I had were PDFs which don't compress that well, but there was still some benefit. Not much luck (obviously) with file types that were already compressed.