Search code examples
compressionbigdatagzipavroparquet

What’s the difference between data storage format and compression format?


I know that in Big Data areas the usage of data storage file formats such as Parquet, Avro and more is very wide. I know that these formats are meant to improve performance, compatibility, schema evolution, compression and more. I want to focus on compression and understand why exactly do these formats use behind the scenes compression formats like gzip, zlib and snappy?

And this leads me to the main question I have - what’s the difference between keeping my data in a gzip format, to keeping it in Parquet? Why are compression formats take place in a different category, and not just other options for data storage formats?


Solution

  • Compression (e.g. GZIP) and (structural) encoding (Parquet format) of the data are two different techniques that can be combined. In essence, you actually always have to do this.

    Compression is only taking a binary stream of data and applies its algorithm to shrink the data. It doesn't care about the actual information stored in the stream of bytes.

    For your data to be stored into a binary stream, you need to think of a binary representation for it. If you are already looking at Parquet, I would assume that you have tabular data. For this common encodings are using CSV or Parquet. Compression is independent of it. You can apply it afterwards on both representation to make the on-disk storage smaller. In the case of the case of the Parquet format though, compression (incl. GZIP) is already built into the format to apply it in a more efficient manner than over the whole binary stream at once.