Here's what I want to do: A program that will a list of pdf files as its input and return one .txt file for each file of the list.
For example, given a listA = ["file1.pdf", "file2.pdf", "file3.pdf"], I want Python to create three txt files (one for each pdf file), say "file1.txt", "file2.txt" and "file3.txt".
I have the conversion part working smoothly thanks to this guy. The only change I've made is in the maxpages statement, in which I assigned 1 instead of 0 in order to extract only the first page. As I said, this part of my code is working perfectly. here's the code.
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
#maxpages = 0
maxpages = 1
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
The thing is I cannot seem to have Python return me is what I stated in the second paragraph. I've tried the following code:
def save(lst):
i = 0
while i < len(lst):
txtfile = "enegep"+str(i)+".txt" #enegep is like the identifier of the files
artigo = convert_pdf_to_txt(lst[0])
with open(txtfile, "w") as textfile:
textfile.write(artigo)
i += 1
I ran that save function with a list of two pdf files as the input, but it generated only one txt file and kept running for several minutes without generating the second txt file. What's a better approach to fulfill my goals?
You don't update i
so your code gets stuck in an infinite loop you need to i += 1
:
def save(lst):
i = 0 # set to 0 but never changes
while i < len(lst):
txtfile = "enegep"+str(i)+".txt" #enegep is like the identifier of the files
artigo = convert_pdf_to_txt(lista[0])
with open(txtfile, "w") as textfile:
textfile.write(artigo)
i += 1 # you need to increment i
A better option would be to simply use range
:
def save(lst):
for i in range(len(lst)):
txtfile = "enegep{}.txt".format(i) #enegep is like the identifier of the files
artigo = convert_pdf_to_txt(lista[0])
with open(txtfile, "w") as textfile:
textfile.write(artigo)
You also only use lista[0]
so you may want to also change that code to move accross the list each iteration.
if lst is actually lista you can use enumerate
:
def save(lst):
for i, ele in enumerate(lst):
txtfile = "enegep{}.txt".format(i) #enegep is like the identifier of the files
artigo = convert_pdf_to_txt(ele)
with open(txtfile, "w") as textfile:
textfile.write(artigo)