I was able to extract the tables using Tabula. I looked for ways on how to output the texts in between them using Tabula but it seems like it is only for tables. Any idea on how to do it?
public static List<Table> extractTablesFromPDF(PDDocument document) {
NurminenDetectionAlgorithm detectionAlgorithm = new NurminenDetectionAlgorithm();
ExtractionAlgorithm algExtractor;
SpreadsheetExtractionAlgorithm extractor=new SpreadsheetExtractionAlgorithm();
ObjectExtractor extractor = new ObjectExtractor(document);
PageIterator pages = extractor.extract();
List<Table> tables=new ArrayList<Table>();
while (pages.hasNext()) {
Page page = pages.next();
if (extractor.isTabular(page)) {
algExtractor=new SpreadsheetExtractionAlgorithm();
}
else
algExtractor=new BasicExtractionAlgorithm();
List<Rectangle> tablesOnPage = detectionAlgorithm.detect(page);
for (Rectangle guessRect : tablesOnPage) {
Page guess = page.getArea(guessRect);
tables.addAll((List<Table>) algExtractor.extract(guess));
}
}
return tables;
}
Thank you in advance for your help!
maintainer of Tabula here.
There are no public methods in Tabula to do so, but you can resort to PDFBox's PDFTextStripper
.
Looking at one of the command line tools included with PDFBox might be useful: https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java