I have a Java class Split
that is responsible for splitting PDF files into multiple parts based on page ranges. The class uses PDFBox for this purpose. Additionally, I have a PDFModel
class to manage the resulting PDF files and a Range
class to specify page ranges.
Here's the Split
class:
public class Split{
private Logger logger;
private File inputFile;
private PDFModel pdfModel;
private File outputDirectory;
public Split(Logger logger, File inputFile, File outputDirectory) {
// Constructor logic...
}
/**
* Splits a PDF file based on a list of page ranges and saves the resulting partial PDFs.
*
* @param ranges A list of page ranges specifying which pages to split from the input PDF.
* @return An ArrayList of PDFModel objects representing the resulting partial PDFs.
*/
public ArrayList<PDFModel> splitByRanges(ArrayList<Range> ranges){
ArrayList<PDFModel> results = new ArrayList<>();
for (int i = 0; i < ranges.size(); i++) {
PDDocument partial = split(ranges.get(i));
if(partial == null) {
continue;
}
File outputFile = new File(Paths.get(outputDirectory.getAbsolutePath(), "file_" + i + ".pdf").toString());
try {
partial.save(outputFile);
results.add(new PDFModel(outputFile, partial));
logger.info(this, "Successfully splitted '" + inputFile + "' from page " + ranges.get(i).getFrom() + " to " + ranges.get(i).getTo() + " into '" + outputFile.getAbsolutePath() + "'");
} catch (IOException e) {
e.printStackTrace();
}
}
return results;
}
private PDDocument split(Range range) {
PDDocument result = new PDDocument();
int fromPage = range.getFrom();
int toPage = range.getTo();
// Get the PDPageTree from the PDDocument
PDPageTree pdPageTree = pdfModel.getPDDocument().getPages();
if (fromPage <= 0 || toPage <= 0 || fromPage > toPage || toPage > pdPageTree.getCount()) {
logger.warning(this, "Invalid page range for splitting.");
return null;
}
for (int i = range.getFrom() -1; i < range.getTo(); i++) {
result.addPage(pdPageTree.get(i));
}
return result;
}
}
The org.apache.pdfbox.multipdf.Splitter
does the same but doesn't work either.
private PDDocument split(Range range) {
int fromPage = range.getFrom();
int toPage = range.getTo() ;
PDDocument pddocument = pdfModel.getPDDocument();
Splitter splitter = new Splitter();
splitter.setStartPage(fromPage);
splitter.setEndPage(toPage);
splitter.setSplitAtPage(toPage - fromPage +1 );
List<PDDocument> lst = null;
try {
lst = splitter.split(pddocument);
} catch (IOException e) {
e.printStackTrace();
}
return lst.get(0);
}
The PDFModel
class:
public class PDFModel {
private File file;
private PDDocument pdDocument;
private ArrayList<PDFImage> images;
private ArrayList<String> pages;
public PDFModel(File file, PDDocument pdDocument) {
// Constructor logic...
}
}
The Range
class:
public class Range {
private int from;
private int to;
public Range(int from, int to) {
// Constructor logic...
}
}
I'm trying to use this Split
class to split a PDF file into multiple parts using the following code:
This throws an error:
Splitter splitter = new Splitter(logger, inputFile, outputDirectory);
splitter.splitByRanges(new ArrayList<Range>(Arrays.asList(new Range(1, 7), new Range(8, 9), new Range(10, 11)));
And this works perfectly fine (not for org.apache.pdfbox.multipdf.Splitter
):
Splitter splitter = new Splitter(logger, inputFile, outputDirectory);
splitter.splitByRanges(new ArrayList<Range>(Arrays.asList(new Range(1, 8), new Range(10, 12), new Range(14, 16)));
However, I'm encountering the following StackOverflowError:
Exception in thread "main" java.lang.StackOverflowError
at java.base/java.util.HashMap.tableSizeFor(HashMap.java:378)
at java.base/java.util.HashMap.<init>(HashMap.java:455)
at java.base/java.util.LinkedHashMap.<init>(LinkedHashMap.java:439)
at java.base/java.util.HashSet.<init>(HashSet.java:171)
at java.base/java.util.LinkedHashSet.<init>(LinkedHashSet.java:167)
at org.apache.pdfbox.util.SmallMap.entrySet(SmallMap.java:384)
at org.apache.pdfbox.cos.COSDictionary.entrySet(COSDictionary.java:1232)
at org.apache.pdfbox.pdfwriter.compress.COSWriterObjectStream.writeCOSDictionary(COSWriterObjectStream.java:338)
at org.apache.pdfbox.pdfwriter.compress.COSWriterObjectStream.writeObject(COSWriterObjectStream.java:232)
at org.apache.pdfbox.pdfwriter.compress.COSWriterObjectStream.writeCOSDictionary(COSWriterObjectStream.java:343)
at org.apache.pdfbox.pdfwriter.compress.COSWriterObjectStream.writeObject(COSWriterObjectStream.java:232)
at org.apache.pdfbox.pdfwriter.compress.COSWriterObjectStream.writeCOSArray(COSWriterObjectStream.java:321)
at org.apache.pdfbox.pdfwriter.compress.COSWriterObjectStream.writeObject(COSWriterObjectStream.java:228)
How can I resolve this StackOverflow error?
The problem seems to be on pdfbox
's side so here just a workaround for version 3.0.0
private PDDocument split(Range range) {
PDDocument pdDocument = new PDDocument();
for (PDPage pdPage : pdfModel.getPDDocument().getPages()) {
pdDocument.addPage(pdPage);
}
int fromPage = range.getFrom();
int toPage = range.getTo();
int pageCount = pdDocument.getNumberOfPages();
if (fromPage > 0 && toPage > 0 && pageCount >= fromPage && pageCount < toPage) {
logger.warning(this, "Invalid page range for splitting.");
return null;
}
System.out.println("Page count: " + pdDocument.getNumberOfPages());
for (int n = pageCount - 1; n >= toPage; n--) {
pdDocument.removePage(n);
}
for (int n = fromPage -2; n >= 0; n--) {
pdDocument.removePage(n);
}
return pdDocument;
}
Use org.apache.pdfbox
version > 3.0.0
or a later version 3.0.1
and above hopfully this issue resolves the bug.