Search code examples
pythonpdfpdfminer

How do I use pdfminer as a library


I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.

I thought I was on to something when I found this link, but I didn't have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.

I also tried the function shown here, but it also did not work.

Another approach I tried was to call the script within a script using os.system. This was also unsuccessful.

I am using Python version 2.7.1 and pdfminer version 20110227.


Solution

  • Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.

    from pdfminer.pdfinterp import PDFResourceManager, process_pdf
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from cStringIO import StringIO
    
    def convert_pdf(path):
    
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    
        fp = file(path, 'rb')
        process_pdf(rsrcmgr, device, fp)
        fp.close()
        device.close()
    
        str = retstr.getvalue()
        retstr.close()
        return str
    

    This solution was valid until API changes in November 2013.