I need to add more domains to allowed_domains , so I dnt get the " Filtered offsite request to ".
My app gets urls to fetch from a database, so I cant add them manually.
I tried to override the spider init
like this
def __init__(self):
super( CrawlSpider, self ).__init__()
self.start_urls = []
for destination in Phpbb.objects.filter(disable=False):
start_urls was fine, this was my first issue to solve. but the allow_domains makes no affect.
I need to change some configuration in order to disable domain checking? I dont want this since I only want the ones from the database, but It could help me for now to disable domain check.
parameter is optional. To get started, you can skip it to disable domain filtering In scrapy/contrib/spidermiddleware/offsite.py
you can override this function for your custom domain filtering function :
def get_host_regex(self, spider):
"""Override this method to implement a different offsite policy"""
allowed_domains = getattr(spider, 'allowed_domains', None)
if not allowed_domains:
return re.compile('') # allow all by default
domains = [d.replace('.', r'\.') for d in allowed_domains]
regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
return re.compile(regex)