I draw the simple structure for the target_html
:
table--div--tr[id="tr1"]
|--tr[id="tr2"]
|--tr[id="tr3"]
|--tr[id="tr4"]
To extract the first tr from target_html with lxml.
target_html="""
<table id="t1">
<div id="div1">
<tr id="tr1">
<td>11</td>
<td>12</td>
</tr>
</div>
<tr id="tr2">
<td>21</td>
<td>22</td>
</tr>
<tr id="tr3">
<td>31</td>
<td>32</td>
</tr>
<tr id="tr4">
<td>41</td>
<td>42</td>
</tr>
</table> """
doc=lxml.html.fromstring(target_html)
for item in doc.xpath('//tr[1]'):
print(item.text_content())
Expexted result parsed by lxml:
11
12
The real result parsed by lxml:
11
12
21
22
Why two trs were parsed as tr[1]
?
The xpath //tr[1]
means select any tr
element that is the first child element (with that name) of its parent.
The following tr
is selected because it's the first tr
child of div
:
<tr id="tr1">
<td>11</td>
<td>12</td>
</tr>
The following tr
is selected because it's the first tr
child of table
:
<tr id="tr2">
<td>21</td>
<td>22</td>
</tr>
To grab the first occurrence, first wrap the xpath in parentheses...
doc.xpath('(//tr)[1]')