Search code examples
pythonscrapy

what's the different between response.xpath() and Selector(text=response.text).xpath()


>>> print(response.text)
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <sitemap>
  <loc>https://cargadgetss.com/sitemap-product.xml</loc>
 </sitemap>
 <sitemap>
  <loc>https://cargadgetss.com/sitemap-category.xml</loc>
 </sitemap>
 <sitemap>
  <loc>https://cargadgetss.com/sitemap-page.xml</loc>
 </sitemap>
</sitemapindex>

>>> response.xpath('//loc')
[]
>>> Selector(text=response.text).xpath('//loc')[0].extract()
'<loc>https://cargadgetss.com/sitemap-product.xml</loc>'
>>>

I would to extract the tag info from the "xml" text.Actually,I have just started to learn how to extract data with scrapy, where always use respone.xpath in the code, but this time,it does't work.So I tried to use "Selector", luckily,I got the data what I need.But I still don't understand Why can the data be extracted with Selector, but not only with .xpath?


Solution

  • That's because the XML namespace (xmlns). Another way to extract those URLs is to assign some prefix to the namespace and use it on the selector.

    For example:

    >>> response.xpath("//x:loc/text()", namespaces={"x": "http://www.sitemaps.org/schemas/sitemap/0.9"}).getall()                  
    ['https://cargadgetss.com/sitemap-product.xml',
     'https://cargadgetss.com/sitemap-category.xml',
     'https://cargadgetss.com/sitemap-page.xml']
    

    (More info about namespaces and parsel)

    However, if you want to extract links from a sitemap, I advise you to use Scrapy's SitemapSpider. Eg.:

    from scrapy.spiders import SitemapSpider
    
    class MySpider(SitemapSpider):
        sitemap_urls = ['http://www.example.com/sitemap.xml']
        sitemap_rules = [
            ('/product/', 'parse_product'),
            ('/category/', 'parse_category'),
        ]
    
        def parse_product(self, response):
            pass # ... scrape product ...
    
        def parse_category(self, response):
            pass # ... scrape category ...