I am teaching myself webscraping and wanted to download a bunch of .pgn files (essentially text files), using requests
. Filenames are in the form of dates but are not strictly chronological. I ran a loop over possible dates, but if an indexed date doesn't correspond to a file, I still end up downloading the filename.pgn
as a text file with the html of the error page. Instead, what I want is for these dates to be skipped.
Here's an example:
If I run:
filename = 'games9jul18.pgn'
url = 'https://www.chesspublishing.com/p/9/jul18/'+filename
response = requests.post(url, data=payload)
with open(filename, 'wb') as e:
e.write(response.text)
with the appropriate authentication in payload
, the correct file games9jul18.pgn
is saved. But if I run:
filename = 'games9aug18.pgn'
url = 'https://www.chesspublishing.com/p/9/aug18/'+filename
response = requests.post(url, data=payload)
with open(filename, 'wb') as e:
e.write(response.text)
I still get a saved file games9aug18.pgn
, but instead of being a 'real' pgn file, it's a text file of the html of the error page. Navigating to the error page on my browser, it has no error code but a big chunk of text The page you've asked may have been removed, or perhaps never existed.
Unfortunately, it's not possible to loop only over the filenames corresponding to actual files, due to the inconsistent date structure. How can I add a condition to not create .pgn
files if the error page is reached?
You should check the response status. "Page not found" is 404, so you could check for that code or even check for a successful request, which is 200:
response = requests.post(url, data=payload)
if response.status == 200:
with...