I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?
I just need to download all the pages of the site without extracting any item. Do I only need to set following all links in the Rule of Spider? But I don't know whether or not scrapy will avoid replicate urls in this way.
I just found the answer myself. With the CrawlSpider
class, we just need to set variable allow=()
in the SgmlLinkExtractor
function. As the documentation says:
allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.