Search code examples
pythonpdfpython-3.xpdfminer

Warnings on pdfminer


I have found and (slightly) modified this script in stackoverflow for it to work on python 3.3:

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)

    fp = open(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    string = retstr.getvalue()
    retstr.close()
    return string


print(convert_pdf('abc.pdf'))

It works fine, however i seem to be having 2 issues:

  • While running the script I get tons of warnings:

    WARNING:root:undefined: PDFCIDFont: basefont='LKOELN+Wingdings-Regular', cidcoding='Adobe-Identity', 139
    WARNING:root:undefined: PDFCIDFont: basefont='LKKPCF+Wingdings2', cidcoding='Adobe-Identity', 132

Which in the printed text looks like (cid:139), how do I catch this warnings and replace that text with something else?

  • Note that I have a codec line, which in the original script goes inside the TextConverter(rsrcmgr, retstr, laparams=laparams), however I get:

    Traceback (most recent call last): File "C:/Users/rodrigo/Desktop/csp_pdf/csp_pdf2.py", line 46, in convert_pdf('abc.pdf') File "C:/Users/rodrigo/Desktop/csp_pdf/csp_pdf2.py", line 33, in convert_pdf device = TextConverter(rsrcmgr, retstr, codec = 'utf-8', laparams=laparams) TypeError: init() got an unexpected keyword argument 'codec'

Is this related to the first issue?

Thanks!


Solution

  • Pdfminer3k logs to the Python root logger unfortunately. PDFMiner should implement logging correctly IMHO. So it is not possible to disable logging in the normal manner like.

    logging.getLogger("pdfminer").setLevel(logging.WARNING)
    

    Bummer!

    I did this and it works™:

        logging.propagate = False 
        logging.getLogger().setLevel(logging.ERROR)
    

    It sets the root logger to level Error. This will stop PDFMiner warn logging, since it logs to the root logger, but not your own logging.

    I needed to set propagation to False, because after PDFMiner usage, I had duplicate logging entries. This was caused by the root logger.