I have a batch of PDF files I have to split into multiple PDFs for each page. I've written a script that processes the PDFs, and works for individual files or a couple of files. But when processing multiple PDFs (total 6000 pages) it eventually runs out of memory, and prints "Warning: You did not close a PDF Document". I've gotten it to work by lowering the memory settings with MemoryUsageSetting.setupMixed(50_000_000)
, but if I set it to 200mb it always runs out. I'm convinced there is an internal COSDocument
that is opened by pdfbox that I can't close, and that's why I'm seeing the warning. I know I'm closing all of the PDDocument
that I create. In fact I wrote some code to check each instance I have access to and none of those log statements show up. Here is my code:
int splitPdf( File pdfFile ) {
String filePrefix = prefix
if( !filePrefix ) filePrefix = pdfFile.name[0..<(pdfFile.name.lastIndexOf("."))]
logger.info("Splitting ${pdfFile.name} pages ${startAtPage} - ${endAtPage} every ${splitPagesEvery} pages")
PDDocument document = PDDocument.load( pdfFile, MemoryUsageSetting.setupTempFileOnly() )
try {
Splitter splitter = new Splitter()
splitter.setStartPage(startAtPage)
splitter.setEndPage(endAtPage)
splitter.setSplitAtPage(splitPagesEvery)
splitter.memoryUsageSetting = MemoryUsageSetting.setupMixed(50_000_000)
int page = startAtPage > 0 ? startAtPage : 1
splitter.split(document).each { PDDocument doc ->
String filename = "${filePrefix}-${page}.pdf"
try {
if( extractions ) {
PDFTextStripper stripper = new PDFTextStripper()
String pageText = stripper.getText( doc )
Map<String,String> results = extractions.collectEntries([filename: filename]) { name, spec ->
[ name, spec.call(pageText) ]
}
this.manifest.write(results)
}
doc.save( new File( destDir, filename) )
page++
} finally {
doc.close()
if( !doc.document.isClosed() ) logger.info("${filename} is NOT closed!")
}
}
return page - startAtPage
} finally {
document.close()
if( !document.document.isClosed() ) logger.info("${pdfFile.name} is NOT closed!")
}
}
I don't have proof of a memory leak in pdfbox 2.0.29, but I can't explain this any other way as to why I see loads of warnings about not closing a PDF Document. I'm working on creating another script I can run in a profiler to see if the number of COSDOcuments are way higher than the ones I know should be there.
My question is there any way an extra COSDocument could be created inside pdfbox that isn't being freed up while performing page splitting?
Ok I think I found the answer. The warning only shows up after the OutOfMemoryError is thrown. Since the Splitter pre-allocates the PDDocument for each page and saves those to a List you need to have enough memory to store the full file along with all the pages in memory. I had a document with 2009 pages within it, and that would trigger the OOME in the middle, and at that point all of the existing pages would be placed on the finalize queue. And it was the instances that were allocated by Splitter where I hadn't yet visited those that would print that error message. The fix that I described already was to lower the cached memory in MemorySettings
so that all of the pages could be processed. As Dagget pointed out all of those pages will be eventually free, but the warning message makes it seem like the OOME was due to NOT calling close which isn't true. You just don't have enough memory. ¯_(ツ)_/¯