Search code examples
javaamazon-web-servicesamazon-s3apache-poiexport-to-excel

How to generate an Excel file in-memory and stream it out?


I haven't found a way to do the following yet:

  1. Read a very large data set
  2. Write the data to an Excel file (avoid JVM out-of-memory, but ideally don't want to write the whole thing to disk)
  3. Upload it in parts (to S3 in my case)

Hoping to do all those piece by piece (read some, generate some, write some, repeat). But haven't figured out if I can both avoid memory problems and avoid writing to disk. Something like this could maybe work? :

  1. Incrementally read the data using pagination
  2. Apache POI Streaming looks like a good way to generate it in memory
  3. Use AWS S3 multipart upload to transfer it incrementally

But (2) seems to be a problem:

Apache POI streaming keeps a limited number of rows in memory which is great, but the end result is still flushing the workbook to file.

Has anyone done something like this before? (And is it even possible given that Excel is a binary format a collection of zipped XML files?)


Solution

  • Efficiently streaming large documents as they are constructed is a common problem. SXSSFWorkbook provides a good framework using a sliding window model. As a practical matter, finding the optimal approach will require profiling in a particular context.

    For reference below, I have updated the example cited to allow adjusting the window size. A variation of this example, used on the command line, can be used to asses the output. The result appears to work seamlessly on any network mounted volume, but the result may depend on the details of this vendor's offering. The experimental DeferredSXSSFWorkbook may be relevant going forward; @PJ Fanning elaborates here.

    
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import org.apache.poi.ss.usermodel.Cell;
    import org.apache.poi.ss.usermodel.Row;
    import org.apache.poi.ss.util.CellReference;
    import org.apache.poi.xssf.streaming.SXSSFWorkbook;
    import org.junit.Assert;
    
    /**
     * @see https://stackoverflow.com/q/78456843/230513
     */
    public class StreamTest {
    
        private static final int N = 1000;
        private static final int W = 10;
    
        public static void main(String[] args) throws IOException {
            // keep W rows in memory, N - W rows will be flushed to disk
            var wb = new SXSSFWorkbook(W);
            var sh = wb.createSheet();
            for (int rownum = 0; rownum < N; rownum++) {
                Row row = sh.createRow(rownum);
                for (int cellnum = 0; cellnum < 10; cellnum++) {
                    Cell cell = row.createCell(cellnum);
                    String address = new CellReference(cell).formatAsString();
                    cell.setCellValue(address);
                }
            }
            // Verify that W rows before N - W are flushed and inaccessible
            for (int rownum = N - (2 * W); rownum < N - W; rownum++) {
                Assert.assertNull(sh.getRow(rownum));
            }
            // The last W rows are still in memory
            for (int rownum = N - W; rownum < N; rownum++) {
                Assert.assertNotNull(sh.getRow(rownum));
            }
            var file = new File("sxssf.xlsx");
            try (FileOutputStream out = new FileOutputStream(file)) {
                wb.write(out);
            } finally {
                // dispose of temporary files backing this workbook on disk
                wb.dispose();
            }
        }
    }