I'm trying to extract text out of a pdf using Python. For this I found pdfminer, which does a fairly good job, using the pdf2txt.py command line tool as follows:
kramer65 $ pdf2txt.py myfile.pdf
all the text contents
of the pdf
are printed out here..
Because I want to use this functionality in my program, I want to use this as a module rather than a command line tool. So I managed to adjust the pdf2txt.py file to the following:
#!/usr/bin/env python
import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
def main(fp):
debug = 0
pagenos = set()
maxpages = 0
imagewriter = None
codec = 'utf-8'
caching = True
laparams = LAParams()
PDFDocument.debug = debug
PDFParser.debug = debug
CMapDB.debug = debug
PDFPageInterpreter.debug = debug
resourceManager = PDFResourceManager(caching=caching)
outfp = sys.stdout
device = TextConverter(resourceManager, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter)
interpreter = PDFPageInterpreter(resourceManager, device)
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
outfp.close()
return # Here I want to return the extracted text string
I can now call it as a module as follows:
>>> from my_pdf2txt import main
>>> main(open('myfile.pdf', 'rb'))
all the text contents
of the pdf
are printed out here..
It currently prints out the resulting strings using sys.stdout.write()
, but I actually want it to return those strings using the return
statement on the last line of my code. But since the use of that sys.stdout.write is hidden deep on lines 165-167 in converter.py, I don't really know how to get this method to return those strings instead of writing it to stdout.
Does anybody know how I could get this method to return the found strings instead of writing them to stdout? All tips are welcome!
As suggested by Darth Kotik, you can point sys.stdout
to whatever file-like object you want. Then when you call a function, the printed data will be directed to your object, rather than the screen. Example:
import sys
import StringIO
def frob():
sys.stdout.write("Hello, how are you doing?")
#we want to call frob, storing its output in a temporary buffer.
#hold on to the old reference to stdout so we can restore it later.
old_stdout = sys.stdout
#create a temporary buffer object, and assign it to stdout
output_buffer = StringIO.StringIO()
sys.stdout = output_buffer
frob()
#retrieve the result.
result = output_buffer.getvalue()
#restore the old value of stdout.
sys.stdout = old_stdout
print "This is the result of frob: ", result
Output:
This is the result of frob: Hello, how are you doing?
For your problem, you would just replace the frob()
call with main(fp)
.