I am new to scrappy and I was playing with the scrapy shell trying to crawl this site: www.spiegel.de/sitemap.xml
I did it with
scrapy shell "http://www.spiegel.de/sitemap.xml"
and it works all fine, when i use
response.body
i can see the whole page including xml tags
however for instance this:
response.xpath('//loc')
simply wont work.
The result i get is an empty array
while
response.selector.re('somevalidregexpexpression')
would work
any idea what could be the reason? could be related to encoding or so? the site is not utf-8
I am using python 2.7 on Win 7. I tried the xpath() on another site (dmoz) and it worked fine.
The problem was due to the default namespace declared at the root element of the XML :
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
So in that XML, the root element and its descendants without prefix inherits the same namespace, implicitly.
On the other hand, in XPath, you need to use prefix that bound to a namespace URI to reference element in that namespace, there is no such default namespace implied.
You can use selector.register_namespace()
to bind a namespace prefix to the default namespace URI, and then use the prefix in your XPath :
response.selector.register_namespace('d', 'http://www.sitemaps.org/schemas/sitemap/0.9')
response.xpath('//d:loc')