Search code examples
javaitext

How to use iText to parse paths (such as lines in the document)


I am using iText to parse text in a PDF document, and i am using PdfContentStreamProcessor with a RenderListener. Such as:

  PdfReader reader = new PdfReader(file.toURI().toURL());
  int numberOfPages = reader.getNumberOfPages();
  MyRenderListener listener = new MyRenderListener ();
  PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
  for (int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++) {
     PdfDictionary pageDic = reader.getPageN(pageNumber);
     PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
     Rectangle pageSize = reader.getPageSize(pageNumber);
     listener.startPage(pageNumber, pageSize);
     processor.processContent(ContentByteUtils.getContentBytesForPage(reader, pageNumber), resourcesDic);
  }

I have no problem to get the text with the renderText(TextRenderInfo) method, but how do I parse the graphic content appart from images? For example in my case I would like to get:

  • Text content which is in a box
  • Horizontal lines

Solution

  • Per mkl comment, by using ExtRenderListener I am able to get the geometries. I used How to extract the color of a rectangle in a PDF, with iText for reference