Search code examples
javacsvcomparisonavroapache-drill

what is the best way to compress a csv file with many duplicates?


I'm dealing with some data like the following, the first column is the trade id, the second column is the simulation id(duplicate a lot), the third column is some stupid date also quite duplicated, the forth one is the present value of a trade, mostly it is just 0, but any other value should be quite unique.

My question would be, is there any way to compress the data to 20% storage of its current size meanwhile supporting look up function?

I have tried Avro project as a way, it can save 40% storage and support apache drill query, but my boss expect there should be 80% saving.

41120634|1554|20150203|-509057.56
40998001|1554|20150203|0
40960705|1554|20150203|0
40998049|1554|20150203|0
41038826|1554|20150203|0
41081136|1554|20150203|-7198152.23
41120653|1554|20150203|-319.436349
41081091|1554|20150203|-4.28520907E+009
41120634|1536|20150227|-528555.02
41038808|1536|20150227|0
40998001|1536|20150227|0
41120634|1556|20150130|-528822.733
40960705|1536|20150227|0
40998049|1536|20150227|0
41038826|1536|20150227|0

Solution

  • Apache Drill supports the Parquet file format. Parquet is a columnar based file format that supports columnar compression. This allows Parquet to exploit repeated values in columns to save space. By comparison Avro is a row based file format, so it will not be able to achieve as much compression as Parquet for repeated values in columns. These guys have reported 87% compression of their csv data using parquet. More information about how to use Parquet with apache drill is here.

    Also as a side note, the Drill team is working on improvements for Parquet which will probably go into the 1.13 release. I believe a 4x increase in read performance was achieved for Parquet files with the new improvements.