Search code examples
javapdfitextpdfboxitextpdf

How can we extract text content from PDF file without header and footer


How can we extract text content from PDF file, we are using pdfbox to extract text from PDF file but we are getting header and footer is not required. I am using following java code.

PDFTextStripper stripper = null;
  try {
    stripper = new PDFTextStripper();
   } catch (Exception e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
   }
     stripper.setStartPage(pageCount);
     stripper.setEndPage(pageCount);
   try {
      String pageText = stripper.getText(document);
       System.out.println(pageText);  
    } catch (Exception e) {
     // TODO Auto-generated catch block
     e.printStackTrace();
 }

Solution

  • You have tagged this as an itext/itextpdf question, yet you are using PdfBox. That's confusing.

    You also claim that your PDF file has headers and footers. This would imply that your PDF is a Tagged PDF and that the header and the footer are marked as artifacts. If that is the case, than you should take advantage of the Tagged nature of the PDF, and extract the PDF as is done in the ParseTaggedPdf example:

    TaggedPdfReaderTool readertool = new TaggedPdfReaderTool();
    PdfReader reader = new PdfReader(StructuredContent.RESULT);
    readertool.convertToXml(reader, new FileOutputStream(RESULT));
    reader.close();
    

    If this doesn't result in anything, you clearly don't have a Tagged PDF in which case there are no headers and footers in your document from a technical point of view. You may see headers and footers with your human eyes, but that doesn't mean that a machine sees these headers and footers. To a machine, it's just text like any other text in the page.

    The ExtractPageContentArea example shows how we can define a rectangle that excludes the header and the footer when parsing for the content.

    PdfReader reader = new PdfReader(pdf);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));
    Rectangle rect = new Rectangle(70, 80, 490, 580);
    RenderFilter filter = new RegionTextRenderFilter(rect);
    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
    }
    out.flush();
    out.close();
    reader.close();
    

    In this case, we have examined the document manually and we noticed that the actual text is always added inside the rectangle new Rectangle(70, 80, 490, 580). The header is added above Y coordinate 580 and below coordinate 80. By using the RegionTextRenderFilter we can extract the content excluding the content that doesn't overlap with the rectangle we have defined.