Search code examples
pythonxpathweb-scrapingscrapyparsel

Scrapy xpath removing text after < character


I am trying to get product information from this page. To get the description (present at the bottom of the page), I use the xpath

response.xpath('//*[@itemprop="description"]/table//text()').extract()[3].strip()

This gives me the description:

u'Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section ('

whereas the one present on the site is

Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (<2cm), Belt Length: 93cm
Product Type: Belts, Accessories

I have verified that the content on the site loads even after disabling javascript. What am I missing here?


Solution

  • this should still be handled without any hack but you could get this working with:

    from parsel import Selector
    ...
    
    s = Selector(text=response.body_as_unicode(), type='xml')
    s.xpath('//*[@itemprop="description"]/table//text()').extract()[3].strip()
    # gives u'Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (2cm), Belt Length: 93cm'
    

    the problem here is that parsel (inner scrapy parser) uses lxml.etree.HtmlParser(recover=True, encoding='utf8') which removes this kind of weird characters to avoid problems.