Inspired by discussion Extracting text from pdf (java using pdfbox library) from a table's rows with different heights I'm able to perfectly read "normal" tables. Kudos to mkl.
The issue is that I cannot figure out how to read data from tables where text is merged from few cells. I will still continue my brainstorming, but if somebody has idea how we can improve code from mkl in class PdfBoxFinder to allow processing of tables with merged cells I would appreciate. I will definitely provide solution here if I find myself. Thanks to all in advance.
I was trying to find merged cells based on text, but it is not very effective. This approach generates to many types of tables. I'm looking for more generic solution. I will be trying to check of x positions of texts, but I'm not there yet. Demo if available on GitHub Demo
Example files are:
Merged cells https://github.com/pdob-git/testarea-pdfbox2/blob/Pdob-stack_78001237/pl.pdob.input/merged_cells_example.pdf
Regular tables https://github.com/pdob-git/testarea-pdfbox2/blob/Pdob-stack_78001237/pl.pdob.input/regular_table.pdf
Result how code currently recognizes tables is show in following files:
Merged cells https://github.com/pdob-git/testarea-pdfbox2/blob/Pdob-stack_78001237/pl.pdob.results/merged_cells_example.pdf-rectangles.pdf
Regular tables https://github.com/pdob-git/testarea-pdfbox2/blob/Pdob-stack_78001237/pl.pdob.results/regular_table.pdf-rectangles.pdf
Regular Tables are recognized correctly, but issue is with merged cells.
Document with merged bottom row:
Is recognized as regular table - bottom row has 3 cells and should have one
Demo code:
package pl.pdob.pdfTables;
import mkl.testarea.pdfbox2.extract.PdfBoxFinder;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import java.awt.*;
import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;
/**
* Class to demonstrate issue with tables with merged cells<br>
* <a href="https://stackoverflow.com/questions/78001237/extracting-text-from-pdf-java-using-pdfbox-library-from-a-tables-with-merged-c">
* Extracting text from pdf (java using pdfbox library) from a tables with merged cells
* </a>
* <br>
* Method Drawing found rectangles taken from {@link mkl.testarea.pdfbox2.extract.ExtractBoxedText}<br>
* and modified
*/
public class Stack78001237 {
private static final File RESULT_FOLDER = new File("pl.pdob.results");
private static final File INPUT_FOLDER = new File("pl.pdob.input");
private static final String EXAMPLE_PDF = "regular_table.pdf";
// private static final String EXAMPLE_PDF = "merged_cells_example.pdf";
static {
if (!INPUT_FOLDER.exists()) {
//noinspection ResultOfMethodCallIgnored
INPUT_FOLDER.mkdirs();
}
if (!RESULT_FOLDER.exists()) {
//noinspection ResultOfMethodCallIgnored
RESULT_FOLDER.mkdirs();
}
}
public static void main(String[] args) throws IOException {
Stack78001237 stack78001237 = new Stack78001237();
stack78001237.drawBoxes(EXAMPLE_PDF);
}
@SuppressWarnings("SameParameterValue")
private void drawBoxes(String fileName) throws IOException {
File file = new File(INPUT_FOLDER, fileName);
try (
PDDocument document = PDDocument.load(file) ) {
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfBoxFinder boxFinder = new PdfBoxFinder(page);
boxFinder.processPage(page);
try (PDPageContentStream canvas = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true, true)) {
canvas.setStrokingColor(Color.RED);
for (Rectangle2D rectangle : boxFinder.getBoxes().values()) {
canvas.addRect((float)rectangle.getX(), (float)rectangle.getY(), (float)rectangle.getWidth(), (float)rectangle.getHeight());
}
canvas.stroke();
}
}
document.save(new File(RESULT_FOLDER, fileName + "-rectangles.pdf"));
}
}
}
The issue is that file PdfBoxFinder.java works perfectly, but only with regular tables.
I'm currently digging how to solve it. If I knew, solution I would not bother stackoverflow community with such question.
I have solved this issue.
The approach is similar to original solution and have following steps:
Demo is in files:
MainDemo.java
ExtractBoxedTextMergedCellsTest.java