Search code examples
pythonpdfborb

Edit the text font in a PDF document


I recently discovered the open-sourced python library, borb, and was impressed by its capabilities! I am trying to use it to modify the font in a pdf document but have been unable to. Is this possible to do?

I have a document that contains certain words which are in Wingdings font. I first need to programmatically find those strings of text that are in Wingdings, and then edit them so they are in Arial.

Is that something that can be accomplished using borb?


Solution

  • disclaimer:I am the author of borb

    Let's break the problem into its individual building blocks.

    Finding Wingdings

    First we need to find everything that's written in Wingdings. For this we can use something similar to the example in section 5.8.1 of the example repository.

    For the sake of completeness, I'll post the code here again:

    #!chapter_005/src/snippet_017.py
    import typing
    
    from borb.pdf import Document
    from borb.pdf import PDF
    from borb.toolkit import FontNameFilter
    from borb.toolkit import SimpleTextExtraction
    
    
    def main():
    
        # create FontNameFilter
        l0: FontNameFilter = FontNameFilter("Courier")
    
        # filtered text just gets passed to SimpleTextExtraction
        l1: SimpleTextExtraction = SimpleTextExtraction()
        l0.add_listener(l1)
    
        # read the Document
        doc: typing.Optional[Document] = None
        with open("output.pdf", "rb") as in_file_handle:
            doc = PDF.loads(in_file_handle, [l0])
    
        # check whether we have read a Document
        assert doc is not None
    
        # print the names of the Fonts
        print(l1.get_text()[0])
    
    
    if __name__ == "__main__":
        main()
    

    If you were to go looking into FontNameFilter you would see something like this:

        def _event_occurred(self, event: "Event") -> None:
            # filter ChunkOfTextRenderEvent
            if isinstance(event, ChunkOfTextRenderEvent):
                font_name: typing.Optional[str] = event.get_font().get_font_name()
                if font_name == self._font_name:
                    for l in self._listeners:
                        l._event_occurred(event)
                return
            # default
            for l in self._listeners:
                l._event_occurred(event)
    

    This code accomplishes the following:

    • Using a ChunkOfTextRenderEvent to represent "a chunk of text being rendered"
    • Whenever that ChunkOfTextRenderEvent happens to have the specified font, pass the Event along to all its children (implementations of EventListener)
    • Whenever anything else comes along (ImageRenderEvent for instance) pass the Event along

    You could easily write your own implementation of EventListener (based on the example I just provided) that looks out for Wingdings, and keeps track of its coordinates.

    Determining what was written

    Fonts are complicated. The way the rendering instructions map to a given character/glyph inside a Font may be different from one PDF to another. You will probably need some custom logic to decide how to "decode" the wingdings characters into legible text.

    Removing content

    This is again "known terrain". You can simply add RedactAnnotation objects to the PDF. An Annotation can be thought of as "content that was added after creating the PDF".

    A RedactAnnotation represents "someone scribbling black marker over the document to erase part of its content".

    After having added a RedactAnnotation, you can "apply" it. Which effectively erases the content.

    More on that in the example repository:

    Adding content to a PDF at a precise location

    Once you've identified the wingdings locations, and determined what you'd like to replace it with, and removed the wingdings characters, you can now add the alternative content.

    In order to do so, you can borrow inspiration from the examples repository.

    #!chapter_002/src/snippet_018.py
    from decimal import Decimal
    
    from borb.pdf import Document
    from borb.pdf import PDF
    from borb.pdf import Page
    from borb.pdf import Paragraph
    from borb.pdf.canvas.geometry.rectangle import Rectangle
    
    
    def main():
        # create Document
        doc: Document = Document()
    
        # create Page
        page: Page = Page()
    
        # add Page to Document
        doc.add_page(page)
    
        # define layout rectangle
        # fmt: off
        r: Rectangle = Rectangle(
            Decimal(59),                # x: 0 + page_margin
            Decimal(848 - 84 - 100),    # y: page_height - page_margin - height_of_textbox
            Decimal(595 - 59 * 2),      # width: page_width - 2 * page_margin
            Decimal(100),               # height
        )
        # fmt: on
    
        # the next line of code uses absolute positioning
        Paragraph("Hello World!").paint(page, r)
    
        # store
        with open("output.pdf", "wb") as pdf_file_handle:
            PDF.dumps(pdf_file_handle, doc)
    
    
    if __name__ == "__main__":
        main()
    

    The crucial part here is:

    Paragraph("Hello World!").paint(page, r)
    

    Which paints a Paragraph object at a specified Page/Rectangle.

    Now it's a matter of putting it all together.