Search code examples
htmlparsingxpathhref

Using Xpath how to extract data from table cells that contain links sometimes?


I have this html table:

<table class="info">
<tbody>
    <tr><td class="name">Year</td><td>2011</td></tr>
    <tr><td class="name">Storey</td><td>3</td></tr>
    <tr><td class="name">Title</td><td><a href="http://gov.kz/premera/">Premier</a></td></tr>
    <tr><td class="name">Condition</td><td>Renovated</td></tr>
</tbody>
</table>

In this table data is organized in such way that each row contains 2 cells enclosed in <td> tags. First cell contains information about data type. For example year of building of house. Second cell contains year information itself which is 2011.

I am trying to extract information from 2-nd cell (it is: 2011, 3, Premier, Renovated)

I use this Xpath expression:

//table[@class="info"]//td[2]/text()

Received output (wrong):

2011
3
Renovated

Desired output:

2011
3
Premier
Renovated

As you can see 2-nd <td> in 3-rd row instead of just text contains link and therefore information from this row is missed. So, desired string "Premier" is not received. Sometimes cells in rows include links, sometimes it is just plain text. Is there any way I can extract data from 2-nd cell in both cases (link or just text given)?


Solution

  • Just add a second forward slash before text():

    //table[@class="info"]//td[2]//text()
    

    this will fetch text nodes from all children of your selected td