Search code examples
pandascompressionparquet

Size of compressed files on disk increases massively after I sort?


I have a pandas dataframe that I store on disk as a GZIP. On RAM its around 90GB, and when I saved it as gzip using pandas.to_parquet, it compresses to around 3GB.

I recently sorted it using pandas.sort_values on a different column, and all of the sudden this size on disk when I save it using the same method is 60GB.

Why is that happening and is there a different method of sorting / saving to prevent this?


Solution

  • I'd have to guess that your file was previously sorted on a different column, and that matches of the contents of that column with immediately preceding records was an important part of the compression. Then when you sorted on a different column, that other column was effectively randomized so that similar values were no longer near each other. The column you sorted will likely have better compression than before, but that effect is small compared to the originally sorted column. That's what killed the compression.