Search code examples
pythonhtml-parsinglxml

Why the two trs were parsed as just the same first tr by lxml?


I draw the simple structure for the target_html:

table--div--tr[id="tr1"]
     |--tr[id="tr2"]
     |--tr[id="tr3"]
     |--tr[id="tr4"]

To extract the first tr from target_html with lxml.

target_html="""
<table id="t1"> 
<div id="div1"> 
<tr id="tr1"> 
<td>11</td> 
<td>12</td> 
</tr> 
</div> 

<tr id="tr2">
<td>21</td> 
<td>22</td> 
</tr>

<tr id="tr3"> 
<td>31</td> 
<td>32</td> 
</tr> 

<tr id="tr4"> 
<td>41</td> 
<td>42</td> 
</tr> 
</table> """

doc=lxml.html.fromstring(target_html)
for item in doc.xpath('//tr[1]'):
    print(item.text_content())

Expexted result parsed by lxml:

11 
12 

The real result parsed by lxml:

11 
12     

21 
22 

Why two trs were parsed as tr[1]?


Solution

  • The xpath //tr[1] means select any tr element that is the first child element (with that name) of its parent.

    The following tr is selected because it's the first tr child of div:

    <tr id="tr1"> 
    <td>11</td> 
    <td>12</td> 
    </tr>
    

    The following tr is selected because it's the first tr child of table:

    <tr id="tr2">
    <td>21</td> 
    <td>22</td> 
    </tr>
    

    To grab the first occurrence, first wrap the xpath in parentheses...

    doc.xpath('(//tr)[1]')