I need to add more domains to allowed_domains , so I dnt get the " Filtered offsite request to ".
My app gets urls to fetch from a database, so I cant add them manually.
I tried to override the spider init
like this
def __init__(self):
super( CrawlSpider, self ).__init__()
self.start_urls = []
for destination in Phpbb.objects.filter(disable=False):
self.start_urls.append(destination.forum_link)
self.allowed_domains.append(destination.link)
start_urls was fine, this was my first issue to solve. but the allow_domains makes no affect.
I need to change some configuration in order to disable domain checking? I dont want this since I only want the ones from the database, but It could help me for now to disable domain check.
thanks!!
'allowed_domains'
parameter is optional. To get started, you can skip it to disable domain filtering In scrapy/contrib/spidermiddleware/offsite.py
you can override this function for your custom domain filtering function :
def get_host_regex(self, spider):
"""Override this method to implement a different offsite policy"""
allowed_domains = getattr(spider, 'allowed_domains', None)
if not allowed_domains:
return re.compile('') # allow all by default
domains = [d.replace('.', r'\.') for d in allowed_domains]
regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
return re.compile(regex)