Search code examples
pythonweb-scrapingscrapy

How can I debug Scrapy?


I'm 99% sure something is going on with my hxs.select on this website. I cannot extract anything. When I run the following code, I don't get any error feedback. title or link doesn't get populated. Any help?

def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//div[@class=\'footer\']')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('.//a/text()').extract()
        item['link'] = site.select('.//a/@href').extract()
        items.append(item)
    return items

Is there a way I can debug this? I also tried to use the scrapy shell command with an url but when I input view(response) in the shell it simply returns True and a text file opens instead of my Web Browser.

>>> response.url
'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'

>>> hxs.select('//div')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

>>> view(response)
True

>>> hxs.select('//body')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

Solution

  • Scrapy shell is a good tool for that indeed. And if your document has an XML stylesheet, it's probably an XML document. So you can use scrapy shell with xxs instead of hxs as in this Scrapy documentation example about removing namespaces: http://doc.scrapy.org/en/latest/topics/selectors.html#removing-namespaces

    When that doesn't work, I tend to go back to pure lxml.etree and dump the whole document's elements:

    import lxml.etree
    import lxml.html
    
    class myspider(BaseSpider):
        ...
        def parse(self, response):
            self.log("\n\n\n We got data! \n\n\n")
            root = lxml.etree.fromstring(response.body).getroot()
            # or for broken XML docs:
            # root = lxml.etree.fromstring(response.body, parser = lxml.etree.XMLParser(recover=True)).getroot()
            # or for HTML:
            # root = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser()).getroot()
    
            # and then lookup what are the actual elements I can select
            print list(root.iter()) # this could be very big, but at least you all what's inside, the element tags and namespaces