Right now we have a database table (SQL Server 2008 R2) that stores an uploaded file (PDF, DOC, TXT, etc.) in an image type column. A user uploads this file from an ASP.NET application. My project is to get a handle on the size at which this table is growing, and I've come up with a couple of questions along the way.
On the database side, I've discovered the image column type is supposedly somewhat depreciated? Will I gain any benefits to switching over to varbinary(max), or should I say varbinary(5767168) because that is my file size cap, or might as well I just let it stay as an image type as far as space-efficiency is concerned?
On the application side, I want to compress the byte array. Microsoft's built in GZip sometimes made the file bigger instead of smaller. I switched over to SharpZipLib, which is much better, but I still occasionally run into the same problem. Is there a way to find out the average file compression savings before I implement it on a wide scale? I'm having a hard time finding out what the underlying algorithm is that they use.
Would it be worth writing a Huffman code algorithm of my own, or will that present the same problem where there is occasionally a larger compressed file than original file?
For reference, in case it matters, here's the code in my app:
using ICSharpCode.SharpZipLib.GZip;
private static byte[] Compress(byte[] data)
{
MemoryStream output = new MemoryStream();
using (GZipOutputStream gzip = new GZipOutputStream(output))
{
gzip.IsStreamOwner = false;
gzip.Write(data, 0, data.Length);
gzip.Close();
}
return output.ToArray();
}
private static byte[] Decompress(byte[] data)
{
MemoryStream output = new MemoryStream();
MemoryStream input = new MemoryStream();
input.Write(data, 0, data.Length);
input.Position = 0;
using (GZipInputStream gzip = new GZipInputStream(input))
{
byte[] buff = new byte[64];
int read = gzip.Read(buff, 0, buff.Length);
while (read > 0)
{
output.Write(buff, 0, read);
read = gzip.Read(buff, 0, buff.Length);
}
gzip.Close();
}
return output.ToArray();
}
Thanks in advance for any help. :)
I hate to be a jerk and answer my own question, but I thought I'd summarize my findings into a complete answer for anyone else looking to space-efficiently store file/image data within a database:
* Using varbinary(MAX) versus Image?
Many reasons for using varbinary(MAX), but top among them is that Image is deprecated and in a future version of SQL it will be removed altogether. Not starting any new projects with it is just nipping a future problem in the bud.
According to the info in this question: SQL Server table structure for storing a large number of images, varbinary(MAX) has more operations available to be used on it.
Varbinary(MAX) is easy to stream from a .NET application by using an SQL Parameter. Negative one is for 'MAX' length. Like so:
SQLCommand1.Parameters.Add("@binaryValue", SqlDbType.VarBinary, -1).Value = compressedBytes;
* What compression algorithm to use?
I'm really not much closer to a decent answer on this one. I used ICSharpCode.SharpZipLib.Gzip and found it had better performance than the built in zipping functions simply by running it on a bunch of stuff and comparing it.
My results:
I reduced my total file size by about 20%. Unfortunately, a lot of the files I had were PDFs which don't compress that well, but there was still some benefit. Not much luck (obviously) with file types that were already compressed.