How to remove specific pattern text from PDF using PDFBox?

I have a PDF with place holder tags {{place_holder}}. How to remove all occurrence of such tag in the document using PDFBox library.

Sample PDF: https://github.com/nofelkad/pdf-sample/blob/main/sample_tag.pdf

Tried this example but did not work for me.

@Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
    String recentText = recentChars.toString();
    recentChars.setLength(0);


    String operatorString = operator.getName();

    if (TEXT_SHOWING_OPERATORS.contains(operatorString) && "{{full_name}}".equals(recentText))
    {
        return;
    }

    super.write(contentStreamWriter, operator, operands);
}

--- Update 1 ---

With below code, it is working as expected for file shared above. But it does not work with similar file generated with Microsoft word export as PDF. Here is the file which did not work.

In my case pdf can be generated on any system. So I am looking for more of a generic solution.

PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
    @Override
    protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
        String operatorString = operator.getName();
        if (TEXT_SHOWING_OPERATORS.contains(operatorString))
        {
            if(operands.get(0) instanceof COSString ){
                COSString str= (COSString) operands.get(0);
                String text=str.getString();
                String updated= extractStringsBetweenCurlyBraces(text);
                if(!text.equals(updated)){
                    str.setValue(updated.getBytes());
                }
            }
            if(operands.get(0) instanceof COSArray ){
                Iterator var7 =  ((COSArray) operands.get(0)).iterator();
                while(var7.hasNext()) {
                    COSBase obj = (COSBase) var7.next();
                    if (obj instanceof COSString) {
                        COSString str= (COSString) obj;
                        String text=str.getString();
                        String updated= extractStringsBetweenCurlyBraces(text);
                        str.setValue(updated.getBytes());
                    }
                }
            }
        }
        super.write(contentStreamWriter, operator, operands);
    }
    final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
editor.processPage(page);



public static String extractStringsBetweenCurlyBraces(String input) {
    Pattern pattern = Pattern.compile("\\{\\{[^}]*\\}\\}|\\{\\{.*$");
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
        String match = matcher.group();
        String replacement = " ".repeat(match.length()+7);
        input= input.replace(match,replacement);
    }

    pattern = Pattern.compile("^.*?\\}\\}");
    matcher = pattern.matcher(input);
    while (matcher.find()) {
        String match = matcher.group();
        String replacement = " ".repeat(match.length()+7);
        input= input.replace(match,replacement);
    }
    return input;
}

Solution

The main issue in your code (that you provided in Update 1) is that you assume some constant, standard encoding of the COSString arguments of the text showing operators. This is not the case, according to the specification these arguments shall be interpreted as sequences of character codes according to the encoding of the current PDF font object.

Thus, you should replace the PdfContentStreamEditor editor in your code by something like this:

PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
    @Override
    protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
        String operatorString = operator.getName();
        if (TEXT_SHOWING_OPERATORS.contains(operatorString))
        {
            PDFont font = getGraphicsState().getTextState().getFont();
            if (operands.get(0) instanceof COSString) {
                COSString str = (COSString) operands.get(0);
                String text = decode(str, font);// str.getString();
                String updated = extractStringsBetweenCurlyBraces(text);
                if(!text.equals(updated)){
                    str.setValue(font.encode(updated));//(updated.getBytes());
                }
            }
            if (operands.get(0) instanceof COSArray) {
                Iterator<?> var7 = ((COSArray) operands.get(0)).iterator();
                while (var7.hasNext()) {
                    COSBase obj = (COSBase) var7.next();
                    if (obj instanceof COSString) {
                        COSString str = (COSString) obj;
                        String text = decode(str, font);//str.getString();
                        String updated = extractStringsBetweenCurlyBraces(text);
                        str.setValue(font.encode(updated));//(updated.getBytes());
                    }
                }
            }
        }
        super.write(contentStreamWriter, operator, operands);
    }

    String decode(COSString string, PDFont font) throws IOException {
        StringBuilder builder = new StringBuilder();
        try (InputStream in = new ByteArrayInputStream(string.getBytes())) {
            while (in.available() > 0) {
                int code = font.readCode(in);
                String chars = font.toUnicode(code);
                builder.append(chars);
            }
        }
        return builder.toString();
    }

    final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};

(RemoveText method removeTagsLikeVasK)

As you see, instead of updated.getBytes() now font.encode(updated) is used, ensuring that the COS string contains the bytes of updated encoded according to the encoding of the font.

Similarly, str.getString() is replaced by decode(str, font) which decodes the bytes of the COS string according to the font encoding.

With this change in place, the output is not distorted anymore.

Unfortunately, though, the output is not yet as you want it. The reason is that your extractStringsBetweenCurlyBraces method only recognizes that it has to remove something from a given string if it contains a {{ or a }}. But that approach ignores the possibility that a previous string already contained the {{ and a later string will contain the }}.

In your example file created by MS Word, that exact case occurs, it contains e.g. an instruction like this (with character codes replaced by regular text):

[(I {{fu) -3 (l) 3 (l) 3 (_n) -3 (ame) -3 (}} e) 5 (mplo) 8 (y) 13 (ee) -3 ( of {{ c) 8 (ompa) -2 (n) 18 (y_n) -3 (am) 5 (e}},agr) 15 (ee) -3 ( tha) 12 (t I wi) 4 (l) 3 (l) 3 (n) ] TJ

As you see, {{full_name}} for example is split over 6 strings, I {{fu, l, l, _n, ame, and }} e, and your approach will only handle the first and the last of them!

This means that you will have to update your mechanism to keep the state (inside or outside a tag) at the end of one string as starting state of the next one.

Furthermore, even the {{ and }} tag delimiters may be split into separate strings, even more cases to consider.

Alternatively you can try the approach of this question & answer. While in your current approach you essentially create a near identical copy of the original content stream and manipulate it only a little bit, the approach there gathers the text information from the original and drops the original content stream, building a completely new content stream from the gathered information except the characters to drop.

You write:

In my case pdf can be generated on any system. So I am looking for more of a generic solution.

You can make your solution more generic as outlined above. But there are limits. In particular there are PDF creators whose outputs contain font information that do not allow the roundtrips applied here (decoding the character codes to Unicode and later re-encoding the manipulated Unicode to character codes), or that do not draw the text as text at all (but as arbitrary vector graphics), or make your life difficult in other ways...

I used PDFBox 2.0.28 for the code in this answer.