Edit the text font in a PDF document

I recently discovered the open-sourced python library, borb, and was impressed by its capabilities! I am trying to use it to modify the font in a pdf document but have been unable to. Is this possible to do?

I have a document that contains certain words which are in Wingdings font. I first need to programmatically find those strings of text that are in Wingdings, and then edit them so they are in Arial.

Is that something that can be accomplished using borb?

Solution

disclaimer:I am the author of borb

Let's break the problem into its individual building blocks.

Finding Wingdings

First we need to find everything that's written in Wingdings. For this we can use something similar to the example in section 5.8.1 of the example repository.

For the sake of completeness, I'll post the code here again:

#!chapter_005/src/snippet_017.py
import typing

from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import FontNameFilter
from borb.toolkit import SimpleTextExtraction


def main():

    # create FontNameFilter
    l0: FontNameFilter = FontNameFilter("Courier")

    # filtered text just gets passed to SimpleTextExtraction
    l1: SimpleTextExtraction = SimpleTextExtraction()
    l0.add_listener(l1)

    # read the Document
    doc: typing.Optional[Document] = None
    with open("output.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l0])

    # check whether we have read a Document
    assert doc is not None

    # print the names of the Fonts
    print(l1.get_text()[0])


if __name__ == "__main__":
    main()

If you were to go looking into FontNameFilter you would see something like this:

    def _event_occurred(self, event: "Event") -> None:
        # filter ChunkOfTextRenderEvent
        if isinstance(event, ChunkOfTextRenderEvent):
            font_name: typing.Optional[str] = event.get_font().get_font_name()
            if font_name == self._font_name:
                for l in self._listeners:
                    l._event_occurred(event)
            return
        # default
        for l in self._listeners:
            l._event_occurred(event)

This code accomplishes the following:

Using a ChunkOfTextRenderEvent to represent "a chunk of text being rendered"
Whenever that ChunkOfTextRenderEvent happens to have the specified font, pass the Event along to all its children (implementations of EventListener)
Whenever anything else comes along (ImageRenderEvent for instance) pass the Event along

You could easily write your own implementation of EventListener (based on the example I just provided) that looks out for Wingdings, and keeps track of its coordinates.

Determining what was written

Fonts are complicated. The way the rendering instructions map to a given character/glyph inside a Font may be different from one PDF to another. You will probably need some custom logic to decide how to "decode" the wingdings characters into legible text.

Removing content

This is again "known terrain". You can simply add RedactAnnotation objects to the PDF. An Annotation can be thought of as "content that was added after creating the PDF".

A RedactAnnotation represents "someone scribbling black marker over the document to erase part of its content".

After having added a RedactAnnotation, you can "apply" it. Which effectively erases the content.

Adding content to a PDF at a precise location

Once you've identified the wingdings locations, and determined what you'd like to replace it with, and removed the wingdings characters, you can now add the alternative content.

In order to do so, you can borrow inspiration from the examples repository.

#!chapter_002/src/snippet_018.py
from decimal import Decimal

from borb.pdf import Document
from borb.pdf import PDF
from borb.pdf import Page
from borb.pdf import Paragraph
from borb.pdf.canvas.geometry.rectangle import Rectangle


def main():
    # create Document
    doc: Document = Document()

    # create Page
    page: Page = Page()

    # add Page to Document
    doc.add_page(page)

    # define layout rectangle
    # fmt: off
    r: Rectangle = Rectangle(
        Decimal(59),                # x: 0 + page_margin
        Decimal(848 - 84 - 100),    # y: page_height - page_margin - height_of_textbox
        Decimal(595 - 59 * 2),      # width: page_width - 2 * page_margin
        Decimal(100),               # height
    )
    # fmt: on

    # the next line of code uses absolute positioning
    Paragraph("Hello World!").paint(page, r)

    # store
    with open("output.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)


if __name__ == "__main__":
    main()

The crucial part here is:

Paragraph("Hello World!").paint(page, r)

Which paints a Paragraph object at a specified Page/Rectangle.

Now it's a matter of putting it all together.