Why distribute by some column increase storage size dramatically?

There is something weird with my program, I was responsible for processing traffic log and stored the processed data in a table. The log is so big that we need to figure out a good way to reduce the storage cost. The processed data, the table schema has 30 columns, 14 of them are about the devices and the app environment, 8 of them are attributes of point or page. Due to our execution engine is Hive, we use ORC as file format and zlib as the compression method. To make table easy for BI/BA to use, we also distribute the data by point_name before write to file. Here comes the problem, if we load data without "distribute by xx", it takes 14.5497GB storage, when I try to distribute by point_name(name of point in the web or app), the storage double its size to 28.6083GB, then I experiment with distribute by cuid(the unique id of device), the sotrage size is 11.7286GB, and then I try distribute by point_name+cuid, the storage is 29.6391GB, I was so confused how can "distribute by" affected the storage so much. Can someone please explain this to me? Thanks a lot.

Solution

distribute by triggers reducer step and each reducer receives it's own key(s) and produces it's own file(s) containing only those keys, this can affect compression efficiency extremely. If the data sits in single file, the compression can be better than if it is in 10 files, significantly better if it is sorted because similar data can be compressed much better using entropy encoding.

Your distribute by keys have some correlation with some other keys and as a result files produced by final reduce vertex can contain similar data or not similar at all, resulting in files with different entropy which can be compressed more or less efficiently.

One of the algorithms used by ZLIB is Lempel–Ziv. Lempel–Ziv algorithm achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. Same values which can be compressed by Lempel–Ziv can be in single file and single stripe (if the data was sorted during load), or same values can be in different files or stripes and can not be compressed efficiently because in each file it will be it's own Lempel–Ziv compression. The same applies to Huffman coding (more frequent values encoded using shorter code, less frequent with longer code). One important thing here is the fact that compression is not possible across different files. Each file and even a stripe in ORC is compressed independently. Put similar data together (distribute, sort or both) and compression will improve.

Try to sort by some low cardinality keys first, high cardinality keys last in addition to distribute by and you will see how the compression improves. Sorting can reduce the size of compressed ORC file by x4 times or even better.