Search code examples
pythonscreen-scrapingscrapy

Dynamically add to allowed_domains in a Scrapy spider


I have a spider that starts with a small list of allowed_domains at the beginning of the spidering. I need to add more domains dynamically to this whitelist as the spidering continues from within a parser, but the following piece of code does not get that accomplished since subsequent requests are still being filtered. Is there another of updating allowed_domains within the parser?

class APSpider(BaseSpider):
name = "APSpider"

allowed_domains = ["www.somedomain.com"]

start_urls = [
    "http://www.somedomain.com/list-of-websites",
]

...

def parse(self, response):
    soup = BeautifulSoup( response.body )

    for link_tag in soup.findAll('td',{'class':'half-width'}):
        _website = link_tag.find('a')['href']
        u = urlparse.urlparse(_website)
        self.allowed_domains.append(u.netloc)

        yield Request(url=_website, callback=self.parse_secondary_site)

...

Solution

  • You could try something like the following:

    class APSpider(BaseSpider):
    name = "APSpider"
    
    start_urls = [
        "http://www.somedomain.com/list-of-websites",
    ]
    
    def __init__(self):
        self.allowed_domains = None
    
    def parse(self, response):
        soup = BeautifulSoup( response.body )
    
        if not self.allowed_domains:
            for link_tag in soup.findAll('td',{'class':'half-width'}):
                _website = link_tag.find('a')['href']
                u = urlparse.urlparse(_website)
                self.allowed_domains.append(u.netloc)
    
                yield Request(url=_website, callback=self.parse_secondary_site)
    
        if response.url in self.allowed_domains:
            yield Request(...)
    
    ...