Search code examples
pdfpdfbox

How to find table border lines in pdf using PDFBox?


I am trying to find table border lines in pdf. I used PrintTextLocations class of pdfBox to make words. Now I am looking to find the coordinates of different lines that form the table. I tried using org.apache.pdfbox.pdfviewer.PageDrawer, but I am unable to find any character/graphic containing those lines. I tried two ways:

First:

Graphics g = null;
Dimension d = new Dimension();
d.setSize(700, 700);
PageDrawer pageDrawer = new PageDrawer();
pageDrawer.drawPage(g, myPage, d);

It gave me null pointer exception. So secondly, I tried to override processStream function, but I am unable to get any stroke. Kindly help me out. I am open in using any other library which gives me coordinates of the lines in the table. And another quick question, what kind of objects are those table border lines in pdfbox? Are these graphics or are these characters?

Here is the link to the sample pdf I am trying to parse: http://stats.bls.gov/news.release/pdf/empsit.pdf and trying to get the table lines on page number 8.

Edit : I faced another problem, while parsing this pdf's page number 1, I am unable to get any lines as the pathIterator in printPath() function is empty, although strokePath() function is called for each line. How to work with this pdf?


Solution

  • In the 1.8.* versions PDFBox parsing capabilities had been implemented in a not very generic way, in particular the OperatorProcessor implementations were tightly associated with specific parser classes, e.g. the implementations dealing with path drawing operations assumed to interact with a PageDrawer instance.

    Thus, unless one wanted to copy & paste all those OperatorProcessor classes with minute changes, one had to derive from such a specific parser class.

    In your case, therefore, we also will derive our parser from PageDrawer, after all we are interested in path drawing operations:

    public class PrintPaths extends PageDrawer
    {
        //
        // constructor
        //
        public PrintPaths() throws IOException
        {
            super();
        }
    
        //
        // method overrides for mere path observation
        //
        // ignore text
        @Override
        protected void processTextPosition(TextPosition text) { }
    
        // ignore bitmaps
        @Override
        public void drawImage(Image awtImage, AffineTransform at) { }
    
        // ignore shadings
        @Override
        public void shFill(COSName shadingName) throws IOException { }
    
        @Override
        public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException
        {
            PDRectangle cropBox = aPage.findCropBox();
            this.pageSize = cropBox.createDimension();
            super.processStream(aPage, resources, cosStream);
        }
    
        @Override
        public void fillPath(int windingRule) throws IOException
        {
            printPath();
            System.out.printf("Fill; windingrule: %s\n\n", windingRule);
            getLinePath().reset();
        }
    
        @Override
        public void strokePath() throws IOException
        {
            printPath();
            System.out.printf("Stroke; unscaled width: %s\n\n", getGraphicsState().getLineWidth());
            getLinePath().reset();
        }
    
        void printPath()
        {
            GeneralPath path = getLinePath();
            PathIterator pathIterator = path.getPathIterator(null);
    
            double x = 0, y = 0;
            double coords[] = new double[6];
            while (!pathIterator.isDone()) {
                switch (pathIterator.currentSegment(coords)) {
                case PathIterator.SEG_MOVETO:
                    System.out.printf("Move to (%s %s)\n", coords[0], fixY(coords[1]));
                    x = coords[0];
                    y = coords[1];
                    break;
                case PathIterator.SEG_LINETO:
                    double width = getEffectiveWidth(coords[0] - x, coords[1] - y);
                    System.out.printf("Line to (%s %s), scaled width %s\n", coords[0], fixY(coords[1]), width);
                    x = coords[0];
                    y = coords[1];
                    break;
                case PathIterator.SEG_QUADTO:
                    System.out.printf("Quad along (%s %s) and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]));
                    x = coords[2];
                    y = coords[3];
                    break;
                case PathIterator.SEG_CUBICTO:
                    System.out.printf("Cubic along (%s %s), (%s %s), and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]), coords[4], fixY(coords[5]));
                    x = coords[4];
                    y = coords[5];
                    break;
                case PathIterator.SEG_CLOSE:
                    System.out.println("Close path");
                }
                pathIterator.next();
            }
        }
    
        double getEffectiveWidth(double dirX, double dirY)
        {
            if (dirX == 0 && dirY == 0)
                return 0;
            Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
            double widthX = dirY;
            double widthY = -dirX;
            double widthXTransformed = widthX * ctm.getValue(0, 0) + widthY * ctm.getValue(1, 0);
            double widthYTransformed = widthX * ctm.getValue(0, 1) + widthY * ctm.getValue(1, 1);
            double factor = Math.sqrt((widthXTransformed*widthXTransformed + widthYTransformed*widthYTransformed) / (widthX*widthX + widthY*widthY));
            return getGraphicsState().getLineWidth() * factor;
        }
    }
    

    (PrintPaths.java)

    As we do not want to actually draw the page but merely extract the paths which would be drawn, we have to strip down the PageDrawer like this.

    This sample parser outputs path drawing operations to show how to do it. Obviously you can instead collect them for automatized processing...

    You can use the parser like this:

    PDDocument document = PDDocument.load(resource);
    List<?> allPages = document.getDocumentCatalog().getAllPages();
    int i = 7; // page 8
    
    System.out.println("\n\nPage " + (i+1));
    PrintPaths printPaths = new PrintPaths();
    
    PDPage page = (PDPage) allPages.get(i);
    PDStream contents = page.getContents();
    if (contents != null)
    {
        printPaths.processStream(page, page.findResources(), page.getContents().getStream());
    }
    

    (ExtractPaths.java)

    The output is:

    Page 8
    Move to (35.92070007324219 724.6490478515625)
    Line to (574.72998046875 724.6490478515625), scaled width 0.5981000089123845
    Stroke; unscaled width: 5.981
    
    Move to (35.92070007324219 694.4660034179688)
    Line to (574.72998046875 694.4660034179688), scaled width 0.5981000089123845
    Stroke; unscaled width: 5.981
    
    Move to (292.2610168457031 468.677001953125)
    Line to (292.8590087890625 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (348.9360046386719 468.677001953125)
    Line to (349.53399658203125 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (405.6090087890625 468.677001953125)
    Line to (406.2070007324219 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (462.281982421875 468.677001953125)
    Line to (462.8799743652344 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (518.9549560546875 468.677001953125)
    Line to (519.553955078125 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (35.92070007324219 725.447998046875)
    Line to (574.72998046875 725.447998046875), scaled width 0.5981000089123845
    Stroke; unscaled width: 5.981
    
    Move to (35.92070007324219 212.5050048828125)
    Line to (574.72998046875 212.5050048828125), scaled width 0.5981000089123845
    Stroke; unscaled width: 5.981
    

    Quite peculiar: The vertical lines actually are drawn as very short (ca 0.6 units) very thick (ca 513 units) horizontal lines...