I'm trying to parse this page using scrapy http://mobileshop.ae/one-x
I need to extract the links of the products. The problem is the links are available in the response.body result, but no available if you try response.xpath('//body').extract()
the results of response.body and response.xpath('//body') are different.
>>> body = response.body
>>> body_2 = response.xpath('//html').extract()[0]
>>> len(body)
238731
>>> len(body_2)
67520
same short result for response.xpath('.').extract()[0]
is there any idea why this happens, and how can I extract the data needed ?
So, the issue here is a lot of mal-formed content in that page, including several unclosed tags. One way to solve this problem is to use lxml's soupparser to parse the mal-formed content (using BeautifulSoup under the covers) and build a Scrapy Selector with it.
Example session with scrapy shell http://mobileshop.ae/one-x
:
>>> from lxml.html import soupparser
>>> from scrapy import Selector
>>> sel = Selector(_root=soupparser.fromstring(response.body))
>>> sel.xpath('//h4[@class="name"]/a').extract()
[u'<a href="http://mobileshop.ae/one-x/htc-one-x-16gb-gray">HTC One X 3G 16GB Grey</a>',
u'<a href="http://mobileshop.ae/one-x/htc-one-x-16gb-white">HTC One X 3G 16GB White</a>',
u'<a href="http://mobileshop.ae/one-x/htc-one-x-32gb-gray">HTC One X 3G 32GB Grey</a>',
u'<a href="http://mobileshop.ae/one-x/htc-one-x-32gb-white">HTC One X 3G 32GB White</a>']
Note that using the BeautifulSoup parser is a lot slower than lxml's default parser. You probably want to do this only in the places where it's really needed.