So I'm relatively new to using XPath and I am having a little difficulty honing in on the exact syntax that I need to use for my specific application. The scraper that I have built is working perfectly fine, (when I use a less complicated path it works). Once I try to get more specific with my path, it isn't returning the proper values.
A simplified model of the document structure that I am trying to manipulate is
<table class="rightLinks">
<tbody>
<tr>
<td>
<a href="http://wwww.example.com">Text That I want to Grab</a>
</td>
<td>Some</td>
<td>Text</td>
</tr>
<tr>
<td>
<a href="http://wwww.example2.com">Text That I want to Grab</a>
</td>
<td>Some</td>
<td>Text</td>
</tr>
<tr>
<td>
<a href="http://wwww.example3.com">Text That I want to Grab</a>
</td>
<td>Some</td>
<td>Text</td>
</tr>
<tr>
<td>
<a href="http://wwww.example4.com">Text That I want to Grab</a>
</td>
<td>Some</td>
<td>Text</td>
</tr>
</tbody>
</table>
Basically, I would like to grab the href values and the text with the links.
This is the portion of my scraper regarding this and what I have tried so far:
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
def parse(self, response):
for sel in response.xpath('//table[@class="rightLinks"]/tbody/tr/*[1]/a'):
item = DanishItem()
item['company_name'] = sel.xpath('/text()').extract()
item['website'] = sel.xpath('/@href').extract()
yield item
Edit: new paths I'm using
def parse(self, response):
for sel in response.xpath('//table[@class="rightLinks"]/tr/*[1]/a'):
item = DanishItem()
item['company_name'] = sel.text
item['website'] = sel.attrib['href']
yield item
Final Edit: Working code (thanks guys!)
def parse(self, response):
for sel in response.xpath('//table[@class="rightLinks"]/tr/*[1]/a'):
item = DanishItem()
item['company_name'] = sel.xpath('./text()').extract()
item['website'] = sel.xpath('./@href').extract()
yield item
Any suggestions or hints would be much appreciated!
Joey
sel.xpath('/text()')
and sel.xpath('/@href')
are both absolute paths; if you wanted relative paths, this would be ./text()
or ./@href
.
If this is lxml -- and sel
is an lxml Element
object -- just use sel.text
, or sel.attrib['href']
-- no XPath needed.