I want to pickle html from websites. I save the html to a list and try to pickle it. An example of one such list is the html from brckhmptn.com/tour. Of course the html from this site is a lot, is that my error? The whole script is there but the error is called in the last few lines. I'm using Python 3.6.1
Traceback (most recent call last):
File "./showNotifier.py", line 128, in <module>
File "./showNotifier.py", line 125, in scrape_html
pickle.dump(sites, hf)
RecursionError: maximum recursion depth exceeded
By Stephen Wist
[email protected]
script takes cmd line args to:
indiacte URLS to add
default behaviour is checking if new shows were added
import requests
import pickle
import sys
import argparse
import os
import urllib
from bs4 import BeautifulSoup
urlFile = "urls"
htmlFile = "htmls"
# take in cmd line args
parseArgs = argparse.ArgumentParser(description='Add URLs to cache.')
parseArgs.add_argument('urls', type=str, nargs='*', help='URLs to be added.')
# assign cmd line args to urls.urls
urls = parseArgs.parse_args()
# this function makes sure all files are in place
def status_report():
# this should be the case only the first time the
# script is run
if (os.path.getsize(urlFile) == 0):
urlFileExists = 0
# create urlFile if it does not exist
if (not urls.urls):
print ("No data in the url file. Run the script again but include url(s) on the command line.\n\
e.g. ./showNotifier.py www.myfavoriteband.com")
urlFileExists = 1
# these file should never be deleted, but just in case
if (not os.path.isfile(urlFile)):
f = open("urls","w")
if (not os.path.isfile(htmlFile)):
f = open("htmls","w")
return urlFileExists
urlFileExists = status_report()
# grab the urls in urlFile, or make
# urlFile if it does not exist
def read_urls(urlFileExists):
# assign all urls in urlFile to prevUrls
if (urlFileExists == 1):
uf = open(urlFile, "rb")
prevUrls = pickle.load(uf)
return prevUrls
return 1
prevUrls = read_urls(urlFileExists)
print("prevUrls: {}\n".format(prevUrls))
# we will need to check if the user has
# entered a url that is already stored
# and ignore it so the contents of the stored
# urls must be known
def compare_urls(prevUrls, newUrls):
# no urls were stored in urlFile,
# so just move on with the script
if (prevUrls == 1):
return newUrls
# iterate over all urls given on cmd line
# check for membership in the set of
# stored urls and remove them if the
# test is true
for url in newUrls:
if (url in prevUrls):
print ("duplicate url {} found, ignoring it.\n".format(url))
combinedUrls = newUrls + prevUrls
return combinedUrls
combinedUrls = compare_urls(prevUrls, urls.urls)
print("combinedUrls: {}\n".format(combinedUrls))
print("combo urls[0]: {}\n".format(combinedUrls[0]))
# write all unique urls to file
def write_urls(combinedUrls):
uf = open(urlFile, "wb")
pickle.dump(combinedUrls, uf)
return 0
# visit sites, store their HTML in a list (for now)
def scrape_html(combinedUrls):
sites = []
# could this loop be shortened to a fancy list comprehension
# or lambda expression?
for url in combinedUrls:
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, "html.parser")
hf = open(htmlFile, "wb")
pickle.dump(sites, hf)
return 0
import sys
10 000 recursions should be enough What happened is that somewhere someplace , something is calling itself over and over again. Each time, thats called one recursion. Python has a limit to prevent a program running infinitely. While this is usually a sign of an error, you can increase the limit to how you see fit, as your program might recurse for an unusually big amount.