Search code examples
pythonxpathscrapyresponse

why the results of response.xpath('//html') differs than response.body?


I'm trying to parse this page using scrapy http://mobileshop.ae/one-x

I need to extract the links of the products. The problem is the links are available in the response.body result, but no available if you try response.xpath('//body').extract()

the results of response.body and response.xpath('//body') are different.

>>> body = response.body
>>> body_2 = response.xpath('//html').extract()[0]
>>> len(body)
238731
>>> len(body_2)
67520

same short result for response.xpath('.').extract()[0]

is there any idea why this happens, and how can I extract the data needed ?


Solution

  • So, the issue here is a lot of mal-formed content in that page, including several unclosed tags. One way to solve this problem is to use lxml's soupparser to parse the mal-formed content (using BeautifulSoup under the covers) and build a Scrapy Selector with it.

    Example session with scrapy shell http://mobileshop.ae/one-x:

    >>> from lxml.html import soupparser
    >>> from scrapy import Selector
    >>> sel = Selector(_root=soupparser.fromstring(response.body))
    >>> sel.xpath('//h4[@class="name"]/a').extract()
    [u'<a href="http://mobileshop.ae/one-x/htc-one-x-16gb-gray">HTC One X 3G 16GB Grey</a>',
     u'<a href="http://mobileshop.ae/one-x/htc-one-x-16gb-white">HTC One X 3G 16GB White</a>',
     u'<a href="http://mobileshop.ae/one-x/htc-one-x-32gb-gray">HTC One X 3G 32GB Grey</a>',
     u'<a href="http://mobileshop.ae/one-x/htc-one-x-32gb-white">HTC One X 3G 32GB White</a>']
    

    Note that using the BeautifulSoup parser is a lot slower than lxml's default parser. You probably want to do this only in the places where it's really needed.