So I'm trying to get all the urls in the range whose pages contain either the term "Recipes adapted from" or "Recipe from". This copies all the links to the file up until about 7496, then it spits out HTTPError 404. What am I doing wrong? I've tried to implement BeautifulSoup and requests, but I still can't get it to work.
import urllib2
with open('recipes.txt', 'w+') as f:
for i in range(14477):
url = "http://www.tastingtable.com/entry_detail/{}".format(i)
page_content = urllib2.urlopen(url).read()
if "Recipe adapted from" in page_content:
print url
f.write(url + '\n')
elif "Recipe from" in page_content:
print url
f.write(url + '\n')
else:
pass
Some of the URLs you are trying to scrape do not exist. Simply skip perhaps, by ignoring the exception:
import urllib2
with open('recipes.txt', 'w+') as f:
for i in range(14477):
url = "http://www.tastingtable.com/entry_detail/{}".format(i)
try:
page_content = urllib2.urlopen(url).read()
except urllib2.HTTPError as error:
if 400 < error.code < 500:
continue # not found, unauthorized, etc.
raise # other errors we want to know about
if "Recipe adapted from" in page_content or "Recipe from" in page_content:
print url
f.write(url + '\n')