I would like to extract some book links from this table using scrapy.
The table looks like this :
<table id="table_text">
<tbody>
<tr >
<td>15/02/2014</td>
<td><a href="/book_1.html">Book 1</a></td>
<td>The Author</td>
<td> <a href="/tag1">tag1</a> <a href="/tag2">tag2</a> </td>
<td>Genre</td>
</tr>
and the extracted link should be :
/book_1.html
The selector that I used is
def parse(self, response):
hxs = Selector(response)
hxs = Selector(response)
links = hxs.xpath('//table[@id="table_text"]//tr//td[2]//a//@href')
but print links
shows an empty output : []
I would like to know what is wrong with the xpath
that I used ?
With the information you gave, your XPath is working fine. It could be simplified with
//table[@id="table_text"]//tr/td[2]/a/@href
but your version returns the right node.
When encountering unexpected behavior with scrapy, try to always check the HTML it receives is the one that you expected. HTML retrieved with browsers and with scrapy may be different, because scrapy doesn't handle Javascript (and some browsers try to sanitize HTML).
That's why you should check that the content of response.body is what you expect. If it's not, you'll need to find a workaround :)