Search code examples
pythonxlsxwriter

xlsxwriter - Why shorter strings occupy the same size as twice larger strings?


I'm writing data into xlsx with xlsxwriter. There are columns business_unit, creator_login_sap, etc. Total records in xlsx 130 000. business_unit and creator_login_sap are strings. business_unit has constant length of 4 chars. creator_login_sap has average length of 10 chars.

import xlsxwriter
import io

output = io.BytesIO()
wb = xlsxwriter.Workbook(output)
ws = wb.add_worksheet()

columns = ['business_unit', 'creator_login_sap', ...]
data = [('BU01', 'ALNUDOVAN00'), ...]

for col_idx, column in enumerate(columns):
    ws.write(0, col_idx, column)

for row_idx, row in enumerate(data, 1):
    for col_idx, value in enumerate(row):
        ws.write(row_idx, col_idx, value)

When I was trying to reduce file size I noticed that business_unit and creator_login_sap column weighs almost equal (~450 Kb). This fact confused me.

Why this happens? Maybe there is a way when shorter strings occupy less memory?


Solution

  • The data is already compressed. xlsx is a ZIP package containing XML files. 130K rows in 450KB is less than 4 bytes per row. A text file with the same data would be 1.8MB. That's an impressive compression to 25% of the original size.

    That said, it may be possible to reduce size even farther. You can test this by opening the file in Excel and saving it to a different file. If the result is smaller, there's room for improvement. Excel, the application, uses shared strings extensively to ensure the file is as small as possible. Instead of storing possibly repetitive strings in cells, it stores them in a Shared String resource and stores a reference to the shared value in the cell itself.

    xlsxwriter already use Shared Strings to reduce the size. Other libraries don't do that, resulting in bigger files.

    If you want to reduce the amount of RAM used at the expense of compression size, you can use the the constant_memory mode. This is explained in Working with Memory and Performance. This mode uses less memory by flushing each row and not using shared strings. Another restriction is that it doesn't allow you to modify data after it's written though, which results in formatting restrictions.

    wb = xlsxwriter.Workbook(output,{'constant_memory': True})
    ...
    

    From the docs:

    The trade-off when using 'constant_memory' mode is that you won’t be able to take advantage of any new features that manipulate cell data after it is written. Currently the add_table() method doesn’t work in this mode and merge_range() and set_row() only work for the current row.

    Please don't "optimize" without reason

    I'm currently dealing with files containing 2K rows and 1M empty cells. Somehow, somewhere, someone tried to "optimize" something or other and ended up producing a 5MB file that Pandas has to process fully even though there's almost no data. Resaving such a 10MB file with Excel produces a 50KB file.

    So think of the consumers of that file before rushing to "optimize" anything