I'm using the mstor library to parse an mbox mail file. Some of the files exceed a gigabyte in size. As you can imagine, this can cause some heap space issues.
There's a loop that, for each iteration, retrieves a particular message. The getMessage()
call is what is trying to allocate heap space when it runs out. If I add a call to System.gc()
at the top of this loop, the program parses the large files without error, but I realize that collecting garbage 40,000 times has to be slowing the program down.
My first attempt was to make the call look like if (i % 500 == 0) System.gc()
to make the call happen every 500 records. I tried raising and lowering this number, but the results are inconsistent and generally return an OutOfMemory error.
My second, more clever attempt looks like this:
try {
message = inbox.getMessage(i);
} catch (OutOfMemoryError e) {
if (firstTry) {
i--;
firstTry = false;
} else {
firstTry = true;
System.out.println("Message " + i + " skipped.");
}
System.gc();
continue;
}
The idea is to only call the garbage collector if an OutOfMemory error is thrown, and then decrement the count to try again. Unfortunately, after parsing several thousand e-mails the program just starts outputting:
Message 7030 skipped.
Message 7031 skipped.
....
and so on for the rest of them.
I'm just confused as to how hitting the collector for each iteration would return different results than this. From my understanding, garbage is garbage, and all this should be changing is how much is collected at a given time.
Can anyone explain this odd behavior? Does anyone have recommendations for other ways to call the collector less frequently? My heap space is maxed out.
The mstor library wasn't handling the caching of messages well. After doing some research I found that if you call Folder.close()
(inbox is my folder object above) mstor and javaxmail releases all of the messages that were cached as a result of the getMessage()
method.
I made the try/catch block look like this:
try {
message = inbox.getMessage(i);
// moved all of my calls to message.getFrom(),
// message.getAllRecipients(), etc. inside this try/catch.
} catch (OutOfMemoryError e) {
if (firstTry) {
i--;
firstTry = false;
} else {
firstTry = true;
System.out.println("Message " + i + " skipped.");
}
inbox.close(false);
System.gc();
inbox.open(Folder.READ_ONLY);
continue;
}
firstTry = true;
Each time the catch statement is hit, it takes 40-50 ms to manually clear the cached messages and re-open the folder.
With calling the garbage collector through every iteration, it took 57 minutes to parse a 1.6 gigabyte file. With this logic, it takes only 18 minutes to parse the same file.
Update - Another important aspect in lowering the amount of memory used by mstor is in the cache properties. Somebody else already mentioned setting "mstor.cache.disabled" to true, and this helped. Today I discovered another important property that greatly reduced the amount of OOM catches for even larger files.
Properties props = new Properties();
props.setProperty("mstor.mbox.metadataStrategy", "none");
props.setProperty("mstor.cache.disabled", "true");
props.setProperty("mstor.mbox.cacheBuffers", "false"); // most important