I am trying to find table border lines in pdf. I used PrintTextLocations
class of pdfBox to make words. Now I am looking to find the coordinates of different lines that form the table. I tried using org.apache.pdfbox.pdfviewer.PageDrawer
, but I am unable to find any character/graphic containing those lines. I tried two ways:
First:
Graphics g = null;
Dimension d = new Dimension();
d.setSize(700, 700);
PageDrawer pageDrawer = new PageDrawer();
pageDrawer.drawPage(g, myPage, d);
It gave me null pointer exception. So secondly, I tried to override processStream
function, but I am unable to get any stroke. Kindly help me out. I am open in using any other library which gives me coordinates of the lines in the table. And another quick question, what kind of objects are those table border lines in pdfbox? Are these graphics or are these characters?
Here is the link to the sample pdf I am trying to parse: http://stats.bls.gov/news.release/pdf/empsit.pdf and trying to get the table lines on page number 8.
Edit : I faced another problem, while parsing this pdf's page number 1, I am unable to get any lines as the pathIterator
in printPath()
function is empty, although strokePath()
function is called for each line. How to work with this pdf?
In the 1.8.* versions PDFBox parsing capabilities had been implemented in a not very generic way, in particular the OperatorProcessor
implementations were tightly associated with specific parser classes, e.g. the implementations dealing with path drawing operations assumed to interact with a PageDrawer
instance.
Thus, unless one wanted to copy & paste all those OperatorProcessor
classes with minute changes, one had to derive from such a specific parser class.
In your case, therefore, we also will derive our parser from PageDrawer
, after all we are interested in path drawing operations:
public class PrintPaths extends PageDrawer
{
//
// constructor
//
public PrintPaths() throws IOException
{
super();
}
//
// method overrides for mere path observation
//
// ignore text
@Override
protected void processTextPosition(TextPosition text) { }
// ignore bitmaps
@Override
public void drawImage(Image awtImage, AffineTransform at) { }
// ignore shadings
@Override
public void shFill(COSName shadingName) throws IOException { }
@Override
public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException
{
PDRectangle cropBox = aPage.findCropBox();
this.pageSize = cropBox.createDimension();
super.processStream(aPage, resources, cosStream);
}
@Override
public void fillPath(int windingRule) throws IOException
{
printPath();
System.out.printf("Fill; windingrule: %s\n\n", windingRule);
getLinePath().reset();
}
@Override
public void strokePath() throws IOException
{
printPath();
System.out.printf("Stroke; unscaled width: %s\n\n", getGraphicsState().getLineWidth());
getLinePath().reset();
}
void printPath()
{
GeneralPath path = getLinePath();
PathIterator pathIterator = path.getPathIterator(null);
double x = 0, y = 0;
double coords[] = new double[6];
while (!pathIterator.isDone()) {
switch (pathIterator.currentSegment(coords)) {
case PathIterator.SEG_MOVETO:
System.out.printf("Move to (%s %s)\n", coords[0], fixY(coords[1]));
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_LINETO:
double width = getEffectiveWidth(coords[0] - x, coords[1] - y);
System.out.printf("Line to (%s %s), scaled width %s\n", coords[0], fixY(coords[1]), width);
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_QUADTO:
System.out.printf("Quad along (%s %s) and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]));
x = coords[2];
y = coords[3];
break;
case PathIterator.SEG_CUBICTO:
System.out.printf("Cubic along (%s %s), (%s %s), and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]), coords[4], fixY(coords[5]));
x = coords[4];
y = coords[5];
break;
case PathIterator.SEG_CLOSE:
System.out.println("Close path");
}
pathIterator.next();
}
}
double getEffectiveWidth(double dirX, double dirY)
{
if (dirX == 0 && dirY == 0)
return 0;
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
double widthX = dirY;
double widthY = -dirX;
double widthXTransformed = widthX * ctm.getValue(0, 0) + widthY * ctm.getValue(1, 0);
double widthYTransformed = widthX * ctm.getValue(0, 1) + widthY * ctm.getValue(1, 1);
double factor = Math.sqrt((widthXTransformed*widthXTransformed + widthYTransformed*widthYTransformed) / (widthX*widthX + widthY*widthY));
return getGraphicsState().getLineWidth() * factor;
}
}
As we do not want to actually draw the page but merely extract the paths which would be drawn, we have to strip down the PageDrawer
like this.
This sample parser outputs path drawing operations to show how to do it. Obviously you can instead collect them for automatized processing...
You can use the parser like this:
PDDocument document = PDDocument.load(resource);
List<?> allPages = document.getDocumentCatalog().getAllPages();
int i = 7; // page 8
System.out.println("\n\nPage " + (i+1));
PrintPaths printPaths = new PrintPaths();
PDPage page = (PDPage) allPages.get(i);
PDStream contents = page.getContents();
if (contents != null)
{
printPaths.processStream(page, page.findResources(), page.getContents().getStream());
}
The output is:
Page 8
Move to (35.92070007324219 724.6490478515625)
Line to (574.72998046875 724.6490478515625), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 694.4660034179688)
Line to (574.72998046875 694.4660034179688), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (292.2610168457031 468.677001953125)
Line to (292.8590087890625 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (348.9360046386719 468.677001953125)
Line to (349.53399658203125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (405.6090087890625 468.677001953125)
Line to (406.2070007324219 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (462.281982421875 468.677001953125)
Line to (462.8799743652344 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (518.9549560546875 468.677001953125)
Line to (519.553955078125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (35.92070007324219 725.447998046875)
Line to (574.72998046875 725.447998046875), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 212.5050048828125)
Line to (574.72998046875 212.5050048828125), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Quite peculiar: The vertical lines actually are drawn as very short (ca 0.6 units) very thick (ca 513 units) horizontal lines...