Search code examples
python-3.xpdfmetadatapdfrw

Editing PDF metadata fields with Python3 and pdfrw


I'm trying to edit the metadata Title field of PDFs, to include the ASCII equivalents when possible. I'm using Python3 and the module pdfrw.

How can I do string operations that replace the metadata fields?

My test code is here:

from pdfrw import PdfReader, PdfWriter, PdfString
import unicodedata

def edit_title_metadata(inpdf):

    trailer = PdfReader(inpdf)

    # this statement is breaking pdfrw
    trailer.Info.Title = unicode_normalize(trailer.Info.Title)

    # also have tried:
    #trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))

    PdfWriter("test.pdf", trailer=trailer).write()
    return

def unicode_normalize(s):
    return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')

if __name__ == "__main__":

    edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')

And the traceback is:

Traceback (most recent call last):
  File "get_metadata.py", line 68, in <module>
    main()
  File "get_metadata.py", line 54, in main
    edit_title_metadata(pdf)
  File "get_metadata.py", line 11, in edit_title_metadata
    trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
  File "get_metadata.py", line 18, in unicode_normalize
    return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
  File "/path_to_python/python3.7/site-packages/pdfrw/objects/pdfstring.py", line 550, in encode
    if isinstance(source, uni_type):
TypeError: isinstance() arg 2 must be a type or tuple of types

Notes:

  • This issue at GitHub may be related.

  • FWIW, Also getting same error with Python3.6

  • I've shared the pdf (which has non-ascii hyphens, unicode char \u2010)

.

 wget https://gist.github.com/philshem/71507d4e8ecfabad252fbdf4d9f8bdd2/raw/cce346ab39dd6ecb3a718ad3f92c9f546761e87b/Anadon-2011-Scientific%2520Opinion%2520on%2520the%2520safety%2520e.pdf

Solution

  • You have to use the .decode() method on the metadata fields:

    trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
    

    And full working code:

    from pdfrw import PdfReader, PdfWriter, PdfReader
    import unicodedata
    
    def edit_title_metadata(inpdf):
    
        trailer = PdfReader(inpdf)
        trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
        PdfWriter("test.pdf", trailer=trailer).write()
        return
    
    def unicode_normalize(s):
        return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
    
    if __name__ == "__main__":
    
        edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')