Search code examples
web-crawlerscrapyrulesextractor

SgmlLinkExtractor in scrapy


i need some enlightenment about SgmlLinkExtractor in scrapy.

For the link: example.com/YYYY/MM/DD/title i would write:

Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')]

For the link: example.com/news/economic/title should i write:

r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news)

For the link: example.com/article/title should i write:

r'\article\w+' ? (the url contains always article)


Solution

  • It's not possible to answer "should i" questions if you don't provide complete example strings and what you want to match (and what you don't want to match) with a regular expression.

    I guess, that your regex won't work because you use \ instead of /.

    I recommend you go to regex101 and test if your urls match your regular expressions. See following screenshot:

    enter image description here