Search code examples
xmlxpathscrapyparentdescendant

xpath parent and descendants in scrapy


I am using code

response.xpath("//*[contains(text(), 'Role')]/parent/parent/descendant::td//text()").extract()

to select all td text() content from rows following where word 'Role' is found in the following html table:

<table class="wh_preview_detail" border="1">
   <tr>
      <th colspan="3">
         <span class="wh_preview_detail_heading">Names</span>
      </th>
   </tr>
   <tr>
      <th>Role</th>
      <th>Name No</th>
      <th>Name</th>
   </tr>
   <tr>
      <td>Requestor</td>
      <td>589528</td>
      <td>John</td>
   </tr>
   <tr>
      <td>Helper</td>
      <td>589528</td>
      <td>Mary</td>
   </tr>
</table>

The 'Role' keyword is only acting as an identifier for the table.

In this case I'm expecting results:

['Requestor', '589528', 'John', ...]

However, I get an empty array when performing in scrapy.

My aim is to ultimately group the elements again as records. I have spent a few hours trying others' examples and experimenting in terminal and Chrome but all but 'simple' XPath is beyond me right now. I am looking to understand Xpath so ideally would like a generalised answer with explanation, that way I can learn and also share. Thank you kindly.


Solution

  • As general advice, it's usually easier to craft your XPath expression by going down the tree, step by step, instead of selecting //typeiwant all the way down, and adding predicates for what came before in the tree (with parent or ancestor)

    Let's look at how to solve your use case with Scrapy selectors:

    >>> import scrapy
    >>> t = '''<table class="wh_preview_detail" border="1">
    ...    <tr>
    ...       <th colspan="3">
    ...          <span class="wh_preview_detail_heading">Names</span>
    ...       </th>
    ...    </tr>
    ...    <tr>
    ...       <th>Role</th>
    ...       <th>Name No</th>
    ...       <th>Name</th>
    ...    </tr>
    ...    <tr>
    ...       <td>Requestor</td>
    ...       <td>589528</td>
    ...       <td>John</td>
    ...    </tr>
    ...    <tr>
    ...       <td>Helper</td>
    ...       <td>589528</td>
    ...       <td>Mary</td>
    ...    </tr>
    ... </table>'''
    >>> selector = scrapy.Selector(text=t, type="html")
    >>>
    >>> # what you want comes inside a <table>,
    >>> # after a <tr> that has a child `<th>` with text "Role" inside
    >>> selector.xpath('//table/tr[th[1]="Role"]')
    [<Selector xpath='//table/tr[th[1]="Role"]' data=u'<tr>\n      <th>Role</th>\n      <th>Name '>]
    >>>
    >>> # check with .extract() is that's the one...
    >>> selector.xpath('//table/tr[th[1]="Role"]').extract()
    [u'<tr>\n      <th>Role</th>\n      <th>Name No</th>\n      <th>Name</th>\n   </tr>']
    >>> 
    

    Then, the rows you're interested in are at the same tree level as that <tr> with "Role". In XPath terms, these <tr> elements are along the following-sibling axis

    >>> for row in selector.xpath('//table/tr[th[1]="Role"]/following-sibling::tr'):
    ...     print('------')
    ...     print(row.extract())
    ... 
    ------
    <tr>
          <td>Requestor</td>
          <td>589528</td>
          <td>John</td>
       </tr>
    ------
    <tr>
          <td>Helper</td>
          <td>589528</td>
          <td>Mary</td>
       </tr>
    >>> 
    

    So you have each row, each row having 3 cells, to map to 3 fields:

    >>> for row in selector.xpath('//table/tr[th[1]="Role"]/following-sibling::tr'):
    ...     print({
    ...         "role": row.xpath('normalize-space(./td[1])').extract_first(),
    ...         "number": row.xpath('normalize-space(./td[2])').extract_first(),
    ...         "name": row.xpath('normalize-space(./td[3])').extract_first(),
    ...     })
    ... 
    {'role': u'Requestor', 'number': u'589528', 'name': u'John'}
    {'role': u'Helper', 'number': u'589528', 'name': u'Mary'}
    >>>