Search code examples
javapdfpdfboxdata-extractionboxable

Extracting exact table data from PDF


I am trying to extract each row of my table from a pdf file I created before.

The problem I have, is that empty cells, which I thought would be saved as 'null', are ignored, and not even read as space characters.

enter image description here

extracted from PDF

I extract the content from my PDF via this method:

    public final ArrayList<String> extractLines(final File pdf) throws IOException {
    try (PDDocument doc = PDDocument.load(pdf)) {
        PDFTextStripper strip = new PDFTextStripper();
        String txt = strip.getText(doc);
        String[] arr = txt.split("\n");
        final ArrayList<String> lines = new ArrayList<>(Arrays.asList(arr));
        return lines;
    }
}

Is it even possible to extract the data with whitespaces?

If so, with PDFBox? Or a different method?

EDIT:

Cannot get traprange to work, simple test:

File e = new File("C:/Users/Test/Downloads/a.pdf");

    List<Table> t = new PDFTableExtractor().setSource(e).extract();
    System.out.println(t.get(0).toString());

Only gives me:

enter image description here

Could it have to do with the form of my table?

My table:

enter image description here


Solution

  • I came up with my own solution.

    Since I have a 2D ArrayList, I each have a list containing a row of the table.

    Now I save the position of the non empty cells (only one is not empty per row at any time).

    I save it in a meta data field of the PDF and load this field to get the positions back.