Search code examples
pythoncsvxlsxopenoffice-calc

Size difference between python generated csv and converted xlsx file


I wrote a python program to rename file names (around 500k files) in a specific format. To track the status, I have created an array of array status [[]] to which I am appending filename, file path, status (whether renaming is successful or not) and error (if any). Once the process is completed, exporting that array to a csv using pandas.

status_file = # path to csv file
status_file_stream = open(status_file, 'w')

df_status = pandas.DataFrame(status)
df_status.to_csv(status_file, index=False)

Generated CSV file size is 228 MB, but when I open it with openoffice application and save as xlsx then the new xlsx file size is only 15.6 MB without loosing any data.

Why there is a huge size difference between CSV and XSLX or is this is the issue with pandas to get large file size?


Solution

  • for completeness sake ill write my comment as an answer:

    compression! csv is just a fancy way to say this is a txt with commas, its a plain text file. xlsx on the other hand is a binary format and programs that generate that format compress the data, if your data is mostly the same, which I'm assuming it is because the paths are mostly the same, then its highly compressible, which is what you are seeing

    xlsx uses zip (at least according to google)

    to test this claim I have generated a ~110MB file containing random file names with the same path and used zlib to compress it and the output was ~20MB

    xlsx probably has additional optimizations but I did not dig into the format