Search code examples
pythonxpathweb-crawler

How to search only child elements in xpath, not grandson elements


I only need to count the number of child tr, not the number of grandson tr, but my current count is 8, and the result I want to get is 2. I am a newer, how to solve this problem?

from lxml import etree

html_string = '''
<!DOCTYPE html>
<html lang="en">
    <head>
        <title>title</title>
    </head>
    <body>
        <div class="books">
            <table width="100%" cellspacing="0" cellpadding="0" border="0">
                <tbody>
                    <tr> // want to count
                        <td><p class="en">name:</p>
                        </td>
                        <td>
                            <table width="780" cellspacing="0" cellpadding="0" border="0" class="noComma">
                                <tbody>
                                    <tr>……</tr>
                                    <tr>……</tr>
                                    <tr>……</tr>
                                </tbody>
                            </table>
                        </td>
                    </tr>
                    <tr> // want to count
                        <td style="width: 200px" class="left_title">
                            <p class="en">name:</p>
                        </td>
                        <td>
                            <table width="780" cellspacing="0" cellpadding="0" border="0" class="noComma">
                                <tbody>
                                    <tr>……</tr>
                                    <tr>……</tr>
                                    <tr>……</tr>
                                </tbody>
                            </table>
                        </td>
                    </tr>
                </tbody>
            </table>
        </div>
    </body>
</html>
'''

html =etree.HTML(html_string)
trs = html.xpath('//tr')
print(len(trs))

My current count is 8, and the result I want to get is 2.


Solution

  • Use :

    trs = html.xpath('//tr[not(ancestor::td)]')
    

    That will give only those tr's that don't have a ancestor td

    Or be more explicit:

    //div[@class='books']/table/tbody/tr