I'm looking to export text from pdf as a list of strings where the list is the whole document and strings are the pages of the PDF. I'm using PDFMiner for this task but it is very complicated and I'm on a tight deadline.
So far I've gotten the code to extract the full pdf as string but I need it in the form of list of strings.
my code is as follows
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
f = file('./PDF/' + file_name, 'rb')
data = []
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.get_pages(pdf):
interpreter.process_page(page)
data = retstr.getvalue()
print data
help please.
The issue with your current script is StringIO.getvalue
always returns a string, and this string contains all the data read so far. Moreover, with each page, you're overwriting the variable data
where you're storing it.
One fix is to store the position of StringIO
before it writes, and then reading from this position to the end of the string stream:
# A list for all each page's text
pages_text = []
for page in PDFPage.get_pages(pdf):
# Get (and store) the "cursor" position of stream before reading from PDF
# On the first page, this will be zero
read_position = retstr.tell()
# Read PDF page, write text into stream
interpreter.process_page(page)
# Move the "cursor" to the position stored
retstr.seek(read_position, 0)
# Read the text (from the "cursor" to the end)
page_text = retstr.read()
# Add this page's text to a convenient list
pages_text.append(page_text)
Think of StringIO
as a text document. You need to manage the cursor position as text is added and store the newly-added text one page at a time. Here, we're storing text in a list.