Search code examples
pythonweb-crawlertwisted

Web crawler Using Twisted


I am trying to create a web crawler with python and twisted.What happend is that at the time of calling reactor.run()

I don't know all the link to get. so the code goes like:

def crawl(url):
    d = getPage(url)
    d.addCallback(handlePage)
    reactor.run()

and the handle page has something like:

def handlePage(output):
    urls = getAllUrls(output)

So now I need to apply the crawl() on each of the url in urls.How do I do that?Should I stop the reactor and start again?If I am missing something obvious please tell me.


Solution

  • You don't want to stop the reactor. You just want to download more pages. So you need to refactor your crawl function to not stop or start the reactor.

    def crawl(url):
        d = getPage(url)
        d.addCallback(handlePage)
    
    def handlePage(output):
        urls = getAllUrls(output)
        for url in urls:
            crawl(url)
    
    crawl(url)
    reactor.run()
    

    You may want to look at scrapy instead of building your own from scratch, though.