ORC and Parquet files are themselves(without other compression options, like snappy ) have compression effects(The same data loaded into parquet file will be much smaller than text file), so that I would ask whether I need to specify compression options like snappy to further compress the ORC and parquet file since these files are stored as binary, and maybe compression effect is not that big against binary data.
Update:
I tried with a text file that is 306M, then
text: 306M
parquet: 323M
parquet + snappy: 50M
From the testing result, it looks parquet itself doesn't have compression, it is even larger than text(Don't know reason yet), and parquet + snappy's compression effect is very high.
The compression efficiency of Parquet and ORC depends greatly on your data. Without compression, Parquet still uses encodings to shrink the data. Encodings use a simpler approach than compression and often yield similar results to universal compression for homogenous data. The most commonly used encoding for Parquet is dictionary encoding. We store each unique row value in a dictionary and store the index of this value in the dictionary. When the data in a column has non-unique entries, this removes the duplication of values. But this is also adding the overhead that we store an additional integer per row. While Parquet uses the smallest possible integer type, if you only have unique values in a column, the overall storage for this column will be more than it would be without the "indices". In this case, you should simply dictionary encoding.
In the case where your column has many repeating values, we normally have a better yield through dictionary-encoding-then-compressing. Consider a string column with 2 unique values, 16 bytes each but 1024 rows. In the case of passing the values in plain to the compressor, we would compress 16KiB at once.
With dictionary encoding, we would have a dictionary of 32 byte and 1024 int1
(bits) values. Thus we would have already reduced the data size with dictionary compression down to 160 byte. Compressing data that is a magnitude smaller is always faster, independent of the entropy.
For most real world data, encoding+compression efficiency is normally somewhere between the two listed cases.