Search code examples

Extracting text from pdf (java using pdfbox library) from a tables with merged cells

Inspired by discussion Extracting text from pdf (java using pdfbox library) from a table's rows with different heights I'm able to perfectly read "normal" tables. Kudos to mkl.

The issue is that I cannot figure out how to read data from tables where text is merged from few cells. I will still continue my brainstorming, but if somebody has idea how we can improve code from mkl in class PdfBoxFinder to allow processing of tables with merged cells I would appreciate. I will definitely provide solution here if I find myself. Thanks to all in advance.

I was trying to find merged cells based on text, but it is not very effective. This approach generates to many types of tables. I'm looking for more generic solution. I will be trying to check of x positions of texts, but I'm not there yet. Demo if available on GitHub Demo

Example files are:
Merged cells
Regular tables

Result how code currently recognizes tables is show in following files:
Merged cells
Regular tables

Regular Tables are recognized correctly, but issue is with merged cells.
Document with merged bottom row: Source file for merged cells

Is recognized as regular table - bottom row has 3 cells and should have one Result file for merged cells

Demo code:

package pl.pdob.pdfTables;

import mkl.testarea.pdfbox2.extract.PdfBoxFinder;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;

import java.awt.*;
import java.awt.geom.Rectangle2D;

 * Class to demonstrate issue with tables with merged cells<br>
 * <a href="">
 * Extracting text from pdf (java using pdfbox library) from a tables with merged cells
 * </a>
 * <br>
 * Method Drawing found rectangles taken from {@link mkl.testarea.pdfbox2.extract.ExtractBoxedText}<br>
 * and modified
public class Stack78001237 {

    private static final File RESULT_FOLDER = new File("pl.pdob.results");

    private static final File INPUT_FOLDER = new File("pl.pdob.input");

    private static final String EXAMPLE_PDF = "regular_table.pdf";
//    private static final String EXAMPLE_PDF = "merged_cells_example.pdf";

    static {

        if (!INPUT_FOLDER.exists()) {
            //noinspection ResultOfMethodCallIgnored

        if (!RESULT_FOLDER.exists()) {
            //noinspection ResultOfMethodCallIgnored

    public static void main(String[] args) throws IOException {
        Stack78001237 stack78001237 = new Stack78001237();

    private void drawBoxes(String fileName) throws IOException {
        File file = new File(INPUT_FOLDER, fileName);

        try (
             PDDocument document = PDDocument.load(file) ) {
            for (PDPage page : document.getDocumentCatalog().getPages()) {
                PdfBoxFinder boxFinder = new PdfBoxFinder(page);

                try (PDPageContentStream canvas = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true, true)) {
                    for (Rectangle2D rectangle : boxFinder.getBoxes().values()) {
                        canvas.addRect((float)rectangle.getX(), (float)rectangle.getY(), (float)rectangle.getWidth(), (float)rectangle.getHeight());
   File(RESULT_FOLDER, fileName + "-rectangles.pdf"));

The issue is that file works perfectly, but only with regular tables.
I'm currently digging how to solve it. If I knew, solution I would not bother stackoverflow community with such question.


  • I have solved this issue.
    The approach is similar to original solution and have following steps:

    • read thin rectangles which are forming table borders
    • segregate horizontal and vertical ones, remove duplicates and sort
    • merge similar lines to have one line from the top to bottom and from left to right where possible
    • read boxes formed by lines.

    Demo is in files: