Search code examples
pythonxpathscrapyscraper

Extracting text nodes or elements with relative XPath in Scrapy


So I'm relatively new to using XPath and I am having a little difficulty honing in on the exact syntax that I need to use for my specific application. The scraper that I have built is working perfectly fine, (when I use a less complicated path it works). Once I try to get more specific with my path, it isn't returning the proper values.

A simplified model of the document structure that I am trying to manipulate is

<table class="rightLinks">
  <tbody>
    <tr>
      <td>
        <a href="http://wwww.example.com">Text That I want to Grab</a>
      </td>
      <td>Some</td>
      <td>Text</td>
    </tr>
    <tr>
      <td>
        <a href="http://wwww.example2.com">Text That I want to Grab</a>
      </td>
      <td>Some</td>
      <td>Text</td>
    </tr>
    <tr>
      <td>
        <a href="http://wwww.example3.com">Text That I want to Grab</a>
      </td>
      <td>Some</td>
      <td>Text</td>
    </tr>
    <tr>
      <td>
        <a href="http://wwww.example4.com">Text That I want to Grab</a>
      </td>
      <td>Some</td>
      <td>Text</td>
    </tr>
  </tbody>
</table>

Basically, I would like to grab the href values and the text with the links.

This is the portion of my scraper regarding this and what I have tried so far:

  import scrapy
  from scrapy.selector import HtmlXPathSelector
  from scrapy.http import HtmlResponse

  def parse(self, response):
    for sel in response.xpath('//table[@class="rightLinks"]/tbody/tr/*[1]/a'):
      item = DanishItem()
      item['company_name'] = sel.xpath('/text()').extract()
      item['website'] = sel.xpath('/@href').extract()
      yield item

Edit: new paths I'm using

def parse(self, response):
  for sel in response.xpath('//table[@class="rightLinks"]/tr/*[1]/a'):
    item = DanishItem()
    item['company_name'] = sel.text
    item['website'] = sel.attrib['href']
    yield item

Final Edit: Working code (thanks guys!)

 def parse(self, response):
  for sel in response.xpath('//table[@class="rightLinks"]/tr/*[1]/a'):
    item = DanishItem()
    item['company_name'] = sel.xpath('./text()').extract()
    item['website'] = sel.xpath('./@href').extract()
    yield item

Any suggestions or hints would be much appreciated!

Joey


Solution

  • sel.xpath('/text()') and sel.xpath('/@href') are both absolute paths; if you wanted relative paths, this would be ./text() or ./@href.

    If this is lxml -- and sel is an lxml Element object -- just use sel.text, or sel.attrib['href'] -- no XPath needed.