I read PDF file (Skia/PDF m118 Google Docs Renderer) with PDFBOX but it reads nothing. The document has only one page and does not contain images. I try to read content with PDFTextStripper.
Any idea how read Skia/PDF m118 Google Docs with PDFBOX?
I can open it with Acrobat Reader.
Code snippet
Document dataDocument = new Document();
try {
PDFTextStripper pdfTextStripper = new PDFTextStripper();
pdfTextStripper.setParagraphStart("/t");
pdfTextStripper.setSortByPosition(true);
for (int i = 0; i < document.getNumberOfPages(); i++) {
pdfTextStripper.setStartPage(i);
pdfTextStripper.setEndPage(i);
for (String line : pdfTextStripper.getText(document).split(pdfTextStripper.getParagraphStart())) {
if (!line.isBlank() && line.length() > 3) {
dataDocument.getText().add(line);
}
}
dataDocument.getText().add(":page=" + i);
}
...
PdfBox version
implementation 'org.apache.pdfbox:pdfbox:2.0.30'
Pdf file
This is a link for the document, https://jmp.sh/s/vTpGFHq6nLjzWXzBfxIA Zlaja
Both setStartPage()
and setEndPage()
require the parameter to be 1-based (see javadoc). Thus change your code to:
pdfTextStripper.setStartPage(i + 1);
pdfTextStripper.setEndPage(i + 1);