I am trying to perform a text replace on a PDF using Apache PDFBox using the below code. The text replace is working perfectly but at the same time certain sections of the pdf have gone missing. Could you please help me in finding out what could have caused this. After calling the replaceTextInSecond function, I am just performing the below
document.save("filename.pdf");
document.close();
Could you help me identify the cause? Thanks in advance!!
private static PDDocument replaceTextInSecond(PDDocument document, String searchString, String replacement) {
PDPage page = document.getPage(1);
PDFStreamParser parser;
try {
parser = new PDFStreamParser(page);
parser.parse();
List<?> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
if (op.getName().equals("Tj")) {
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
//System.out.println("string :::: " +string);
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
//System.out.println("string :::: " +string);
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
if (searchString.equals(pstring.trim())) {
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size() - 1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
out.close();
page.setContents(updatedStream);
//System.out.println("replaced " +searchString + " with " + replacement);
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return document;
}
I found the issue, previous.setValue(string.getBytes());
was the culprit. Keeping the code as
if(string.contains(searchString)){
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
}
resolved it. Looks like it was something related to how the pdf was encoded or something.
Anyway, thanks for the help! :)