I've seen this question on many different forums, but I have yet to see it answered properly. There are a few that may work for some people, but they are ridiculously overcomplicated. I found out the solution myself, so please check the answer if you are interested in finding the solution for this.
The answer: you extract the color for each character via the processTextPosition() method in the PDFTextStripper class.
For the color to be extracted, the constructor in PDFTextStripper needs to be overwritten so that it has more operators to extract color from the text, as this initially is not a feature within the default PDFTextStripper. Check https://pdfbox.apache.org/2.0/migration.html under Text Extraction for more information. From that link, we find the operators to add to PDFTextStripper's overwritten constructor:
addOperator(new SetStrokingColorSpace());
addOperator(new SetNonStrokingColorSpace());
addOperator(new SetStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceRGBColor());
addOperator(new SetStrokingDeviceRGBColor());
addOperator(new SetNonStrokingDeviceGrayColor());
addOperator(new SetStrokingDeviceGrayColor());
addOperator(new SetStrokingColor());
addOperator(new SetStrokingColorN());
addOperator(new SetNonStrokingColor());
addOperator(new SetNonStrokingColorN());
We can then add a boolean to our new subclass which will be set to true every time a new line is started while the text is being processed:
public class PDFTextStripperSuper extends PDFTextStripper {
boolean newLine = true;
public PDFTextStripperSuper() throws IOException {
addOperator(new SetStrokingColorSpace());
addOperator(new SetNonStrokingColorSpace());
addOperator(new SetStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceRGBColor());
addOperator(new SetStrokingDeviceRGBColor());
addOperator(new SetNonStrokingDeviceGrayColor());
addOperator(new SetStrokingDeviceGrayColor());
addOperator(new SetStrokingColor());
addOperator(new SetStrokingColorN());
addOperator(new SetNonStrokingColor());
addOperator(new SetNonStrokingColorN());
}
@Override
protected void startPage(PDPage page) throws IOException {
newLine = true;
super.startPage(page);
}
@Override
protected void writeLineSeparator() throws IOException {
newLine = true;
super.writeLineSeparator();
}
}
So now we have a text processor that is ready to extract each line of text as well as the character colors. To implement this, all we have to do is overwrite the writeString() method to get each line of text, as well as overwrite the processTextPosition() method to get the color of each character:
public class DocAnalyzer {
public DocAnalyzer(PDDocument doc) throws IOException {
ArrayList<String> lines = new ArrayList<>();
ArrayList<PDColor> charColors = new ArrayList<>();
PDFTextStripperSuper tp = new PDFTextStripperSuper() {
@Override
protected void writeString(String text, List<TextPosition> textPositions)
throws IOException {
if (newLine) {
lines.add(text);
newLine = false;
}
super.writeString(text, textPositions);
}
@Override
protected void processTextPosition(TextPosition text) {
super.processTextPosition(text);
charColors.add(getGraphicsState().getNonStrokingColor());
}
};
tp.getText(doc);//processes the text and adds to our lists
}
}
There you have it! All the colors of the text should be in your charColors list. That's all the help I'm giving you ;)!