Search code examples
pythonbeautifulsoupweb-crawlerurllib3

Can't see infinite loop


I am trying to write a webcrawler but I am stuck because I cant see infinite loop somewhere in my code.

class Crawler(object):
    def __init__(self, url, query, dir = os.path.dirname(__file__)):
        self.start_url = url
        self.start_parsed = urllib3.util.parse_url(url)
        self.query = re.compile(query, re.IGNORECASE)
        self.dir = dir
        self.__horizon = set()
        self.log = []

        self.__horizon.add(url)
        self.log.append(url)
        print("initializing crawler....")
        print(locals())

    def start(self, depth= 5, url = '/'):
        print(url, depth)
        self.log.append(url)
        if depth > 0:
            pool = urllib3.PoolManager()
            data = pool.request("GET", self.start_url if url == '/' else url).data.decode('utf-8')

            valid_list = []
            self.add_horizon(parser_soup.get_links(data), valid_list)

            if re.search(self.query, parser_soup.get_text(data)):
                self.output(data)

            for u in valid_list:
                self.start(depth = (depth-1), url = u)

    def output(self, data):
        with open(os.path.join(self.dir, get_top_domain(self.start_parsed.host) + '.' + str(time.time()) + '.html'), 'w+') as f:
            f.write(data)

    def add_horizon(self, url_list, valid_list = []):
        for url in url_list:
            if get_top_domain(url) == get_top_domain(self.start_parsed.host)  \
                    and (not str(url) in self.log or not str(url) in self.__horizon):
                valid_list.append(str(url))

        self.__horizon.update(valid_list)

It runs forever. How should I ensure that I eliminate duplicate links?


Solution

  • Adapted from Giogian's code:

    class Crawler(object):
        def __init__(self, url, query, dir=os.path.dirname(__file__)):
            self.visited = set()
            # Rest of code...
    
        def start(self, depth=5, url='/'):
            if url in self.visited:
                return True
            self.visited.add(url)
    

    defaultdict is a dictionary that has a default which is used if the index doesn't exist. This however, is the wrong solution. A set would be more memory efficient and elegant, as shown in my code.

    A set uses O(1) time - just as fast as @Giorgian's answer.

    Use Ctrl-C to interrupt your program when it's in an infinite loop. This will print a Traceback showing the command that was being executed when the program was interrupted. Do this a few times and you should get a good idea of where it happens. Alternatively, use a debugger and pause when it's in the infinite loop and use the "step" feature to run to the next line of execution so you can follow the program's execution. PyCharm is a great editor that includes a debugger. It has good autocompletion and is just good all-around. It's free, check it out.