Search code examples
pythonpdfreturnstdoutpdfminer

How to get this Python method to return a string instead of writing it to stdout?


I'm trying to extract text out of a pdf using Python. For this I found pdfminer, which does a fairly good job, using the pdf2txt.py command line tool as follows:

kramer65 $ pdf2txt.py myfile.pdf
all the text contents
of the pdf
are printed out here..

Because I want to use this functionality in my program, I want to use this as a module rather than a command line tool. So I managed to adjust the pdf2txt.py file to the following:

#!/usr/bin/env python
import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams

def main(fp):
    debug = 0
    pagenos = set()
    maxpages = 0
    imagewriter = None
    codec = 'utf-8'
    caching = True
    laparams = LAParams()

    PDFDocument.debug = debug
    PDFParser.debug = debug
    CMapDB.debug = debug
    PDFPageInterpreter.debug = debug

    resourceManager = PDFResourceManager(caching=caching)
    outfp = sys.stdout
    device = TextConverter(resourceManager, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter)
    interpreter = PDFPageInterpreter(resourceManager, device)
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    outfp.close()
    return  # Here I want to return the extracted text string

I can now call it as a module as follows:

>>> from my_pdf2txt import main
>>> main(open('myfile.pdf', 'rb'))
all the text contents
of the pdf
are printed out here..

It currently prints out the resulting strings using sys.stdout.write(), but I actually want it to return those strings using the return statement on the last line of my code. But since the use of that sys.stdout.write is hidden deep on lines 165-167 in converter.py, I don't really know how to get this method to return those strings instead of writing it to stdout.

Does anybody know how I could get this method to return the found strings instead of writing them to stdout? All tips are welcome!


Solution

  • As suggested by Darth Kotik, you can point sys.stdout to whatever file-like object you want. Then when you call a function, the printed data will be directed to your object, rather than the screen. Example:

    import sys
    import StringIO
    
    def frob():
        sys.stdout.write("Hello, how are you doing?")
    
    
    #we want to call frob, storing its output in a temporary buffer.
    
    #hold on to the old reference to stdout so we can restore it later.
    old_stdout = sys.stdout
    
    #create a temporary buffer object, and assign it to stdout
    output_buffer = StringIO.StringIO()
    sys.stdout = output_buffer
    
    frob()
    
    #retrieve the result.
    result = output_buffer.getvalue()
    
    #restore the old value of stdout.
    sys.stdout = old_stdout
    
    print "This is the result of frob: ", result
    

    Output:

    This is the result of frob:  Hello, how are you doing?
    

    For your problem, you would just replace the frob() call with main(fp).