As part of our production we produce fairly large xlsx files.
Around 1000 columns and 200k rows.
We have noticed that older versions of pandas/openpyxl are doing a much better job at being memory and time efficient in producing these files.
Using:
openpyxl = 3.0.7 and 3.1.2
pandas = 1.2.4
The runtime is roughly half an hour with an output file around 100mb and RAM usage around 4gb
Using
openpyxl = 3.1.4 and 3.1.2
pandas = 2.1.4
it takes 2 hours and the output file is 400mb and it takes up all available RAM at almost 16gb.
I'll do some more experimentation whether this is down to pandas or openpyxl but I was wondering if anyone knows whats going on here.
edit:
I've updated openpyxl to 3.1.2 and it still prints speedily in a small file size.
I've also downgraded to 3.1.2 with pandas 2.1.4 and it still takes ages, so the issue seems to be with pandas.
Turns out there is no performance difference. The difference came because in the old environment it is using xlsxwriter and in the new its using openpyxl.
I should have studied the pd.show_versions() output better (required by the pandas github)