Search code examples
pythonurllib

Download PDFs : Remote end closed connection without response


I want to gather the text from thousands PDF files with Python. The text extraction from PDFs is working fine but my code is stopping randomly (doesn't stop at the same PDF each time) during the execution with this error :

http.client.RemoteDisconnected: Remote end closed connection without response

I'm using urllib. I want to know how can I avoid this error and if I can't how to catch it (even except: does not work)

The code I used :

df = pd.read_csv(csv_path, sep=";", error_bad_lines=False)

for i,row in df.iterrows():
    print(row['year'], "- adding ",row['title'])
    request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
    try:
        row['fullarticle'] = convert_pdf_to_txt("_tmp.pdf")
    except TypeError:
        row['fullarticle'] = ""
        pass

os.remove("_tmp.pdf")
print("Done. Saving csv...")
df.to_csv("my_structured_articles.csv")
print("Done. Head(10) : ")
print(df.head(10))
return df

Solution

  • You need to put the try except block here -

    for i,row in df.iterrows():
        print(row['year'], "- adding ",row['title'])
        try:
            request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
        except http.client.RemoteDisconnected:
            continue # this will skip the url throwing error
    

    You can find the documentation for the exception here.