Search code examples
javaapache-poixssfjxlsjett

Does POI XSSF still have crazy bad memory issues?


A couple years ago, I ran into issues where I was creating large excel files using jXLS and POI XSSF. If my memory is correct, I think XSSF would create something like 1GB+ temp files on the disk to create 10mb excel files. So I stopped using jXLS and instead used SXSSF to create the excel files, but today I have new reasons to use jXLS or JETT.

Both jXLS and JETT websites seem to allude that performance is much better, but POI's XSSF website still says generically that the XSSF requires a higher memory footprint. I am wondering if this higher memory footprint is something like a reasonable 10% overhead these days, or if it is still like the 10,000% overhead as it was a couple years ago.

Are the crazy bad memory issues fixed with POI 3.9 XSSF? Should I not worry about using it with jXLS or JETT? Or are there certain gotchas to avoid? I am careful about reusing cell styles.


Solution

  • To answer your question, yes, POI will always use very large amount of memory when working on large XLSX files, which is much larger than the size of the XLSX files. I don't think this will change anytime soon, and there are pretty obvious reasons for that: XLSX is basically a bunch of zipped XML files, and XML is very well compressed (around 10x). Getting this XML just to sit in memory uncompressed would already increase the memory consumption tenfold, so if you add all the overhead of data structures, there's no way you should expect a 10% increase in memory consumption over the XLSX file size.

    Now, the good news is that as mentioned in the comments, Apache POI introduced SXSSF for streaming very large amount of data in a spreadsheet with very good performance and low memory usage. XLSX files generated this way are still streamed on the hard disk where they can end up taking quite a bit of space, but at least you don't risk OOME when writing hundreds of thousands of rows.

    The problem for you is that you won't be able to get JETT to directly work with SXSSF, as it needs the whole document loaded in memory for performing template filling. JETT author quickly discussed this topic here.

    I had the same problem, and ended up doing a two-step XLSX creation:

    1. A standard JETT XLSX template to generate headers and formatting. The last row of the first sheet contains cells with $$tokens$$, one per cell. I don't use JETT to insert the large amount of rows.

    2. Once JETT did its work, I reopen the workbook, read then delete the $$tokens$$ on the last line of the first spreadsheet, and start streaming data with SXSSF row by row.

    Of course, there are limitations to this approach: - You cannot use JETT on any of the streamed rows during rows insertion (but you can before, to dynamically pick the order of the $$tokens$$ for example) - Cells format won't be copied unless you take care of it yourself with POI API. I personally prefer to format whole columns in my XLSX file, and it will apply to the streamed data.

    This also works if you want to show charts using data inserted with SXSSF: You can define a Named Range with functions OFFSET and COUNTA, then create a Pivot table & Pivot Chart that will be refreshed when the XLSX is opened in Excel.