SgmlLinkExtractor in scrapy

i need some enlightenment about SgmlLinkExtractor in scrapy.

For the link: example.com/YYYY/MM/DD/title i would write:

Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')]

For the link: example.com/news/economic/title should i write:

r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news)

For the link: example.com/article/title should i write:

r'\article\w+' ? (the url contains always article)

Solution

It's not possible to answer "should i" questions if you don't provide complete example strings and what you want to match (and what you don't want to match) with a regular expression.

I guess, that your regex won't work because you use \ instead of /.

I recommend you go to regex101 and test if your urls match your regular expressions. See following screenshot:

enter image description here

Can Anemone crawl html files stored locally on my hard drive?
using scrapy to parse an arbitrary number of rows (key:value pairs) in an html table
Looking for an Open Source Web Crawler that can crawl API requests and parse XML into csv
403 forbidden in combination of selenium and scrapy
Make a web crawler/spider
Facebook crawler is hitting my server hard and ignoring directives. Accessing same resources multiple times
How to exclude part of a web page from google's indexing?
Python + Mechanize Async Tasks
How can I protect open source against (mis)use by AI?
How to view aggregated liquidations for cryptocurrencies from Binance?
Selenium Click() not working with scrapy spider
Is there CURRENTLY anyway to fetch Instagram user media without authentication?
Common Crawl requirement to power a decent search engine
How to programmatically fill input elements built with React?
How to deal with Dynamic cookies when web crawling
Scrapy using start_requests with rules
I scraped web using `rvest` and stored the result of read_html() in a list object. I closed Rstudio and when I reopen and try to load, get an error
Python, Selenium Web Scraping: Popup Issue from the First Web Page to the Second Web Page
Python Web Crawlers and "getting" html source code
How to click a button in the drop-down menu using selenium in my java app?
Reverse search an image in Yandex Images using Python
How to stop google or any search engine indexing the site images?
Python how to find the minimum number of moves for a directory iteration - crawler
Bash script cache warmer ignoring URLs in Magento XML sitemap?
Playwright Crawler Error: "Target page, context or browser has been closed"
Get all page ids linked to a given Wikipedia page
Sending "User-agent" using Requests library in Python
Test all internal links
Odd scenario / hide names from Google crawlers
does a PHP redirection affects the way a crawler or a robot views a website?