Search code examples
pythonxpathscrapy

Scraping links that I don't want to but I don't know how to exclude


Let's say I have this structure

<div data-next="link0">
   <a href="link1"/>
   <a href="link2"/>
   <a href="link3"/>
   <a href="link4"/>
</div>

and with my rule object I want to access only link0, without accessing link1, link2, link3, link4.

How can I do that?

I tried

Rule(LinkExtractor(restrict_xpaths=('//div[@data-next]/@data-next')), callback='parse_item'),

but it won't work, because I need a reference to an element, not the link directly. But if I remove @data-next, link1, link2, link3, link4 will be scraped too.

So, is there any way to scrape just link0 using the Rule object in this context?


Solution

  • Rule(LinkExtractor(restrict_xpaths='//div[@data-next]', tags='div', attrs='data-next'), callback='parse_item'),
    

    LinkExtractor looks for <a> tags and @href attr by default. In this case, you have to specify which tags and attributes it should include in the search. More on that from Scrapy docs:

    Parameters:

    (...)

    • tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').

    • attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)