Search code examples
spring-boothibernatejacksonspring-data

Identifying and resolving memory bloat in a large database retrieval


I have a database containing almost 4 million rows of objects with approximately 10 columns. I need to retrieve the entirety of this database. For business/political reasons, I would prefer to avoid pagination, and the calling code cannot be responsible for deciding the amount of retrieved rows.

I need to write the retrieved data to an OutputStream linked to a HTTP response, and deserialize the data in another application. Currently, I serialize the retrieved data as JSON before writing it to the OutputStream.

The application is built on Spring Boot, I interact with the database using Spring Boot Data (with Hibernate), and I am using Jackson to serialize the database rows. The objects are generally well-behaved, meaning the ORM will never join other tables to my large, 4 million entry table.

My current implementation:

try (Stream<TableEntry> tableEntryStream = tableEntryRepository.streamAllBy();
        SequenceWriter outputStreamSequenceWriter = jsonMapper.writerFor(MappedTableEntry.class)
                .writeValues(request.getOutputStream());
        Session hibernateSession = entityManager.unwrap(Session.class)) {
    Iterator<TableEntry> tableEntryIterator = tableEntryStream.iterator();
    log.debug("Opened database connection.");

    for (int i = 0; ; i++) {
        if (!tableEntryIterator.hasNext()) {
            break;
        }
        TableEntry nextTableEntry = tableEntryIterator.next();

        // I identified the hibernate cache and entityManager as possible memory hogs
        // so I attempt to detach the entity from the entityManager and 
        // evict it from Hibernate L1 cache, allowing the object to be garbage collected.
        hibernateSession.evict(nextTableEntry);
        this.entityManager.detach(nextTableEntry);

        if (i % 1000 == 0) {
            log.debug("Writing element n={} to OutputStream using jsonGenerator.", i);
        }
        MappedTableEntry nextMappedTableEntry = tableEntryMapper.convertTableEntryToMappedTableEntry(nextTableEntry);
        outputStreamSequenceWriter.write(nextMappedTableEntry);
        if (i % 1000 == 0) {
            log.debug("Finished pushing element n={} to OutputStream.", i);
        }
    }        
} catch (IOException ioException) {
    throw new UncheckedIOException(ioException);
}

When I start the retrieval of the database rows, a memory footprint starts building in my application and eventually consumes som 3-4 gigabytes of RAM, slowing the application down to a crawl and crashing my Docker container (limited to 3 gigabytes)

I am not a profiling expert, but when I profile the application, it seems the majority of memory consumption is done by strings, byte arrays, TableEntry objects and EntityDeleteAction objects: Partial profiling snapshot

Since I have no need to maintain the 4 million entries in-memory and I am attempting to push the objects as directly to the OutputStream as possible, I would expect this to be a CPU issue, not a memory issue.

EDIT: Following M. Deinum's suggestions seems to have done the trick. I ended up with the following:

try (Stream<TableEntry> tableEntryStream = tableEntryRepository.streamAllBy();
        OutputStream oos = new ObjectOutputStream(request.getOutputStream());
        SequenceWriter outputStreamSequenceWriter = jsonMapper.writerFor(MappedTableEntry.class)
                .writeValues(oos)) {
    log.debug("Opened database connection.");
    final AtomicInteger counter = new AtomicInteger();
    final AtomicInteger errorCounter = new AtomicInteger();
    tableEntryStream.map(tableEntryMapper::convertTableEntryToMappedTableEntry).forEach(mappedTableEntry -> {
        try {
            outputStreamSequenceWriter.write(mappedTableEntry);

            int count = counter.incrementAndGet();
            if (count % CLEAR_ENTITY_MANAGER_THRESHOLD == 0) {
                log.debug("Clearing entityManager after {} writes to OutputStream.", count);
                outputStreamSequenceWriter.flush();
                entityManager.clear();
            }
            errorCounter.set(0);
        } catch (IOException ex) {
            errorCounter.incrementAndGet();
            if (errorCounter.get() > ERROR_THRESHOLD) {
                throw new UncheckedIOException(ex);
            } else {
                log.warn(
                        "Encountered IOException {}; Processing will continue until {} more exceptions are encountered.",
                        ex.getMessage(), errorCounter.get() - ERROR_THRESHOLD, ex);
            }
        }
    });
} catch (IOException ioException) {
    throw new UncheckedIOException(ioException);
}

There's stille some readability wonkiness, but it seems to work and doesn't bust my memory requirements.


Solution

  • try (Stream<TableEntry> tableEntryStream = tableEntryRepository.streamAllBy();
            SequenceWriter writer = jsonMapper.writerFor(MappedTableEntry.class)
                    .writeValues(request.getOutputStream())) {
        log.debug("Opened database connection.");
        final AtomicInteger counter = new AtomicInteger();
        int entries = tableEntryStream
          .map( (nextTableEntry) -> tableEntryMapper.convertTableEntryToMappedTableEntry(nextTableEntry))
          .mapToInt( (mappedEntry) -> write(mappedEntry, writer, counter))
          .sum();
    } catch (IOException ioException) {
        throw new UncheckedIOException(ioException);
    }
    
    public int write(MappedTableEntry entry, SequenceWriter writer, AtomicInteger counter) {
      writer.write(entry);
      int count = counter.incrementAndGet();
      if (count % 1000) {
        writer.flush():
        this.entityManager.clear();
      }
      return 1;
    }
    

    Something like this. It will clear the EntityManager every 1000 calls. We need a terminating operator hence the mapToInt and the sum. There might be a better way, but could write a quick Collector that does this.

    Another thing, from a performance perspective that is, you might want to specify a fetch size. The default is 10 which means each 10 entries it has to go to the database to fetch the next one. If you set that to, say 1000, that will reduce your network roundtrips.