I want to gather the text from thousands PDF files with Python. The text extraction from PDFs is working fine but my code is stopping randomly (doesn't stop at the same PDF each time) during the execution with this error :
http.client.RemoteDisconnected: Remote end closed connection without response
I'm using urllib. I want to know how can I avoid this error and if I can't how to catch it (even except:
does not work)
The code I used :
df = pd.read_csv(csv_path, sep=";", error_bad_lines=False)
for i,row in df.iterrows():
print(row['year'], "- adding ",row['title'])
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
try:
row['fullarticle'] = convert_pdf_to_txt("_tmp.pdf")
except TypeError:
row['fullarticle'] = ""
pass
os.remove("_tmp.pdf")
print("Done. Saving csv...")
df.to_csv("my_structured_articles.csv")
print("Done. Head(10) : ")
print(df.head(10))
return df
You need to put the try except block here -
for i,row in df.iterrows():
print(row['year'], "- adding ",row['title'])
try:
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
except http.client.RemoteDisconnected:
continue # this will skip the url throwing error
You can find the documentation for the exception here.