I'm using PDFBox to extract text from a document by extending PDFTextStripper. I've noticed that some of these documents contain invisible characters that are being extracted. I'd like to filter out these invisible characters.
I see that there are already some stackoverflow posts on this, for example:
I tried subclassing the PDFVisibleTextStripper
class found here:
However, I found that this filtered out text that was in fact visible. I used it as a drop-in-replacement for PDFTextStripper
.
package com.example.foo;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
public class ExtractChars extends PDFVisibleTextStripper {
Processor processor;
public static void extract(PDDocument document, Processor processor) throws IOException {
ExtractChars instance = new ExtractChars();
instance.processor = processor;
instance.setSortByPosition(true);
instance.setStartPage(0);
instance.setEndPage(document.getNumberOfPages());
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Writer streamWriter = new OutputStreamWriter(stream);
instance.writeText(document, streamWriter);
}
ExtractChars() throws IOException {}
protected void writeString(String _string, List<TextPosition> textPositions) throws IOException {
for (TextPosition text: textPositions) {
float height = text.getHeightDir();
String character = text.getUnicode();
int pageIndex = getCurrentPageNo() - 1;
float left = text.getXDirAdj();
float right = left + text.getWidthDirAdj();
float bottom = text.getYDirAdj();
float top = bottom - height;
BoundingBox box = new BoundingBox(pageIndex, left, right, top, bottom);
this.processor.process(character, box);
}
}
public interface Processor {
void process(String character, BoundingBox box);
}
}
I don't know if there's anything I need to change in my subclass to make this work correctly. I can provide a PDF that exhibits this behaviour if that would be helpful, although it contains sensitive content so I'd need to remove that first.
Instead, I have created a minimal example (below) that exhibits the 'invisible text' behaviour that I am seeing. The bulleted list contains an item at the end '24. a.' that can be highlighted in a PDF viewer such as macOS Preview and copy-pasted out.
This 'a.' is currently being extracted by PDFTextStripper
and I'd like it not to be. I don't really understand why this is happening. My guess would be it's to do with clipping but I'd be really grateful if someone could explain what's going on.
My end goal is to filter these characters out so if you have suggestions for how I could handle this specific case in the simplest possible way, that would be appreciated. I don't think I need all of the general methods in PDFVisibleTextStripper
.
Many thanks!
%PDF-1.3
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Kids [3 0 R]
/Count 1
/MediaBox [0 0 612 792]
>>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 4 0 R
/Contents 6 0 R
/MediaBox [0 0 612 792]
>>
endobj
4 0 obj
<<
/Font <<
/TT2 5 0 R
>>
>>
endobj
5 0 obj
<<
/BaseFont
/OXRDVC+Helvetica
/Subtype /TrueType
/Type /Font
>>
endobj
6 0 obj
<<
>>
stream
q 0 54 612 648 re W n /Cs1 cs 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm Q
q 48 93.30545 516 569.4218 re W n /Cs1 cs 1 1 1 sc 48 93.30545 516 569.4218 re f 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 66.86 589.28 Tm /TT2 1 Tf (24. ) Tj ET Q
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 96.86 40.39 Tm /TT2 1 Tf (a. ) Tj ET Q
endstream
endobj
trailer
<<
/Root 1 0 R
>>
%%EOF
I figured out what's going on. The PDF contains a clipping rectangle that does not include 'a.'. I tried using PDFVisibleTextStripper
but that stripped out text elsewhere in other documents that was in fact visible.
In the end, I wrote a class that inherits from PageDrawer
and implements the showGlyph
method to access the characters being drawn on the page. This method checks if the bounding box of the character is outside getGraphicsState().getCurrentClippingPath().getBounds2D()
.
This unfortunately means I'm not using PDFTextStripper
anymore so I had to reimplement bits of its behaviour such as sorting characters by position (I was using setSortByPosition(true)
). It was also a bit tricky to calculate the correct bounding box of the character based on font size and displacement.
ExtractChars.java
package com.example.foo;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.font.*;
import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import org.apache.pdfbox.util.Vector;
import java.awt.geom.*;
import java.io.*;
// This class effectively renders the PDF document in order to extract its
// text. It intercepts the showGlyph function provided by PageDrawer. We used to
// use PDFTextStripper but that has no way to exclude clipped characters.
public class ExtractChars extends PageDrawerHelper {
// Skip erroneous characters smaller than this height. This might never happen
// but there are places in the code that divide by height, so guard against it.
static final float MIN_CHARACTER_HEIGHT = 0.01f;
Processor processor;
ExtractChars(PageDrawerParameters params, float pageHeight, int pageIndex, Processor processor) throws IOException {
super(params, pageHeight, pageIndex);
this.processor = processor;
}
// We can't move this method up to the superclass because the Renderer is
// different each time. It needs to build an instance of the current class.
public static void extract(PDDocument document, Processor processor) throws IOException {
Renderer renderer = new Renderer(document);
renderer.processor = processor;
for (int i = 0; i < document.getNumberOfPages(); i += 1) {
PDPage page = document.getPage(i);
renderer.pageHeight = page.getMediaBox().getHeight();
renderer.pageIndex = i;
renderer.renderImage(i);
}
}
@Override
public void showGlyph(Matrix matrix, PDFont font, int _code, String unicode, Vector displacement) throws IOException {
if (unicode == null) { return; }
// Get the width and height of the character relative to font size.
// The height does not change but the width does, e.g. 'M' is wider than 'I'.
float width = displacement.getX();
float height = fontHeight(font) / 2;
BoundingBox charBox = clippedBoundingBox(matrix, width, height);
// Skip the character if it is outside the clipping region and not visible.
if (charBox == null) { return; }
float boxHeight = charBox.bottom - charBox.top;
if (boxHeight < MIN_CHARACTER_HEIGHT) { return; }
// We need the text direction so we can sort text in separate buckets based on this.
int direction = textDirection(matrix);
processor.process(unicode, charBox, direction);
}
// https://stackoverflow.com/questions/17171815/get-the-font-height-of-a-character-in-pdfbox#answer-17202929
float fontHeight(PDFont font) {
return font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000;
}
int textDirection(Matrix matrix) {
float a = matrix.getValue(0, 0);
float b = matrix.getValue(0, 1);
float c = matrix.getValue(1, 0);
float d = matrix.getValue(1, 1);
// This logic is copied from:
// https://github.com/atsuoishimoto/pdfbox-ja/blob/master/src/main/java/org/apache/pdfbox/util/TextPosition.java
if ((a > 0) && (Math.abs(b) < d) && (Math.abs(c) < a) && (d > 0)) {
return 0;
} else if ((a < 0) && (Math.abs(b) < Math.abs(d)) && (Math.abs(c) < Math.abs(a)) && (d < 0)) {
return 180;
} else if ((Math.abs(a) < Math.abs(c)) && (b > 0) && (c < 0) && (Math.abs(d) < b)) {
return 90;
} else if ((Math.abs(a) < c) && (b < 0) && (c > 0) && (Math.abs(d) < Math.abs(b))) {
return 270;
}
return 0;
}
// We can't construct an instance of ExtractChars directly because its
// constructor requires PageDrawerParameters which is private to the package.
// Instead, make an instance via a renderer and forward the fields to it.
static class Renderer extends PDFRenderer {
Processor processor;
float pageHeight;
int pageIndex;
Renderer(PDDocument document) {
super(document);
}
protected PageDrawer createPageDrawer(PageDrawerParameters params) throws IOException {
return new ExtractChars(params, pageHeight, pageIndex, processor);
}
}
public interface Processor {
void process(String character, BoundingBox box, int direction);
}
}
PageDrawerHelper.java
package com.example.foo;
import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import java.awt.geom.*;
import java.io.*;
// This class provides utility methods to subclasses, mostly so they can check
// if the currently content is being clipped and therefore should be skipped.
//
// We shouldn't really use inheritance for sharing code but this has the
// advantage of being able to call some methods of the PageDrawer superclass.
public class PageDrawerHelper extends PageDrawer {
float pageHeight;
int pageIndex;
PageDrawerHelper(PageDrawerParameters params, float pageHeight, int pageIndex) throws IOException {
super(params);
this.pageHeight = pageHeight;
this.pageIndex = pageIndex;
}
// Gets the bounding for a matrix by transforming corner points and taking the
// min/max values in the x- and y-directions. This ensures rotation and skew
// are taken into account. This method can return null if content is clipped.
BoundingBox clippedBoundingBox(Matrix matrix, float width, float height) {
Point2D p0 = matrix.transformPoint(0, 0);
Point2D p1 = matrix.transformPoint(0, height);
Point2D p2 = matrix.transformPoint(width, 0);
Point2D p3 = matrix.transformPoint(width, height);
BoundingBox contentBox = boundingBox(p0, p1, p2, p3);
BoundingBox clippedBox = applyClipping(contentBox);
return clippedBox;
}
BoundingBox boundingBox(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
Point2D topLeft = topLeft(p0, p1, p2, p3);
Point2D botRight = botRight(p0, p1, p2, p3);
float left = (float)topLeft.getX();
float right = (float)botRight.getX();
float top = pageHeight - (float)botRight.getY();
float bottom = pageHeight - (float)topLeft.getY();
return new BoundingBox(pageIndex, left, right, top, bottom);
}
Point2D topLeft(Point2D... points) {
double minX = points[0].getX();
double minY = points[0].getY();
for (int i = 1; i < points.length; i += 1) {
minX = Math.min(minX, points[i].getX());
minY = Math.min(minY, points[i].getY());
}
return new Point2D.Double(minX, minY);
}
Point2D botRight(Point2D... points) {
double maxX = points[0].getX();
double maxY = points[0].getY();
for (int i = 1; i < points.length; i += 1) {
maxX = Math.max(maxX, points[i].getX());
maxY = Math.max(maxY, points[i].getY());
}
return new Point2D.Double(maxX, maxY);
}
BoundingBox applyClipping(BoundingBox box) {
Rectangle2D clip = getGraphicsState().getCurrentClippingPath().getBounds2D();
float clipLeft = (float)clip.getMinX();
float clipRight = (float)clip.getMaxX();
float clipTop = pageHeight - (float)clip.getMaxY();
float clipBottom = pageHeight - (float)clip.getMinY();
float left = Math.max(box.left, clipLeft);
float right = Math.min(box.right, clipRight);
float top = Math.max(box.top, clipTop);
float bottom = Math.min(box.bottom, clipBottom);
if (left >= right || top >= bottom) {
return null;
} else {
return new BoundingBox(pageIndex, left, right, top, bottom);
}
}
}
CharacterSorter.java
package com.example.foo;
import java.util.*;
public class CharacterSorter {
ArrayList<String> characters;
ArrayList<BoundingBox> boxes;
ArrayList<Integer> directions;
public CharacterSorter(ArrayList<String> characters, ArrayList<BoundingBox> boxes, ArrayList<Integer> directions) {
this.characters = characters;
this.boxes = boxes;
this.directions = directions;
}
public void sortByDirectionThenPosition() {
ArrayList<Tuple> tuples = new ArrayList();
for (int i = 0; i < characters.size(); i += 1) {
tuples.add(new Tuple(characters.get(i), boxes.get(i), directions.get(i)));
}
Collections.sort((List)tuples);
characters.clear(); boxes.clear(); directions.clear();
for (Tuple tuple: tuples) {
characters.add(tuple.character);
boxes.add(tuple.box);
directions.add(tuple.direction);
}
}
// This helper class wraps the three fields associated with a single character
// and provides a comparator function which mimics how PDFTextStripper orders
// its characters when #setSortByPosition(true) is set.
class Tuple implements Comparable {
String character;
BoundingBox box;
Integer direction;
Tuple(String character, BoundingBox box, Integer direction) {
this.character = character;
this.box = box;
this.direction = direction;
}
public int compareTo(Object o) {
Tuple other = (Tuple)o;
int primary = ((Integer)box.pageIndex).compareTo(other.box.pageIndex);
if (primary != 0) { return primary; }
// The remainder of this logic is copied and adapted from:
// https://github.com/apache/pdfbox/blob/a78f4a2ea058181e5ed05d6367ba7556948331b8/pdfbox/src/main/java/org/apache/pdfbox/text/TextPositionComparator.java#L29-L70
// Only compare text that is in the same direction.
int secondary = Float.compare(direction, other.direction);
if (secondary != 0) { return secondary; }
// Get the text direction adjusted coordinates.
float x1 = box.left;
float x2 = other.box.left;
float pos1YBottom = box.bottom;
float pos2YBottom = other.box.bottom;
// Note that the coordinates have been adjusted so (0, 0) is in upper left.
float pos1YTop = pos1YBottom - (box.bottom - box.top);
float pos2YTop = pos2YBottom - (other.box.bottom - other.box.top);
float yDifference = Math.abs(pos1YBottom - pos2YBottom);
// We will do a simple tolerance comparison.
if (yDifference < .1 ||
pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
{
return Float.compare(x1, x2);
} else if (pos1YBottom < pos2YBottom) {
return -1;
} else {
return 1;
}
}
}
}