Search code examples
pythonweb-scrapingscrapy

How to Restrict xpath using scrapy


I want to restrict some xpath using Link Extractor but they gave me these error you have multiple values for argument kindly give me some suggestion what mistake I am doing

import scrapy
from scrapy.http imporrt Request
from selenium import webdriver
from scrapy.http import HtmlResponse
import time
from scrapy_selenium import SeleniumRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BarSpider(scrapy.Spider):
    name = 'bar'
    start_urls=["https://www.veteranownedbusiness.com/?mode=geo#BrowseByState"]

   
    def parse(self, response):
        books = response.xpath('//table[@class="categories"]//tr//td//a[@class="category"]//@href').extract()
        for book in books:
            url = response.urljoin(book)
            rules = (Rule(LinkExtractor(restrict_xpaths=('//table[@class="categories"]//tr//td[1]//a[@class="category"]//@href'))))
            yield Request(url ,rules,callback='base_url')

    def base_url(self,response):
        links = response.xpath('//table[@class="listings"]//a//@href').extract()
        for link in links:
            b_link = response.urljoin(link)
            yield{
                'url':b_link,
            }

Solution

  • There are a few issues with your spider.

    1. A Rule object is useless unless it is an attribute of a crawlspider. If you simply want to use a LinkExtractor, then you can do so without wrapping it in a Rule object.

    2. LinkExtractors extract links from selectors so you should include the @href at the end of your restrict_xpaths selector list.

    3. This is the cause for the error you are receiving: A Request objects expects only 1 positional argument, which is the url. If it receives a second positional argument then is assumes that the value is the callback. However in your example you have the url as the first parameter, something else as the second parameter and the callback is keyword argument, so it throws an error as having received multiple values for the callback parameter. Also Request objects don't accept Rule objects as parameter.

    What you can do to address these issues is instantiate a LinkExtractor directly, remove the @href part of your xpath, and then iterate the extracted links and yield a Request for each link extracted individually.

    For example:

        def parse(self, response):
            for link in LinkExtractor(restrict_xpaths=[
                '//table[@class="categories"]//tr//td[1]//a[@class="category"]'
            ]).extract_links(response):
                url = response.urljoin(link.url)
                yield Request(url,callback=self.base_url)