I am trying to extract each row of my table from a pdf file I created before.
The problem I have, is that empty cells, which I thought would be saved as 'null', are ignored, and not even read as space characters.
I extract the content from my PDF via this method:
public final ArrayList<String> extractLines(final File pdf) throws IOException {
try (PDDocument doc = PDDocument.load(pdf)) {
PDFTextStripper strip = new PDFTextStripper();
String txt = strip.getText(doc);
String[] arr = txt.split("\n");
final ArrayList<String> lines = new ArrayList<>(Arrays.asList(arr));
return lines;
}
}
Is it even possible to extract the data with whitespaces?
If so, with PDFBox? Or a different method?
EDIT:
Cannot get traprange to work, simple test:
File e = new File("C:/Users/Test/Downloads/a.pdf");
List<Table> t = new PDFTableExtractor().setSource(e).extract();
System.out.println(t.get(0).toString());
Only gives me:
Could it have to do with the form of my table?
My table:
I came up with my own solution.
Since I have a 2D ArrayList, I each have a list containing a row of the table.
Now I save the position of the non empty cells (only one is not empty per row at any time).
I save it in a meta data field of the PDF and load this field to get the positions back.