Search code examples
pythonweb-scrapinglxmllxml.html

Scraping a nested and unstructured table in python (lxml)


The website I'm scraping (using lxml ) is working just fine with everything except a table, in which all the tr's , td's and heading th's are nested & mixed and forms a unstructured HTML table.

<table class='table'>
    <tr>
        <th>Serial No.
            <th>Full Name
                <tr>
                    <td>1
                        <td rowspan='1'> John 
                            <tr>
                                <td>2
                                    <td rowspan='1'>Jane Alleman
                                        <tr>
                                            <td>3
                                                <td rowspan='1'>Mukul Jha
                                                 .....
                                                 .....
                                                 .....
</table>

I tried the following xpaths but each of these is just giving me back a empty list.

persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()')]

persons = [x for x in tree.xpath('//table[@class="table"]/tr/td/td/text()')]

persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s

Finally, what is the reason of such nesting, is it to prevent the scraping ?


Solution

  • It seems lxml loads table in similar way as browser and it creates correct structure in memory and you can see correct HTML when you use lxml.html.tostring(table)

    So it has correctly formated table and it needs normal './tr/td//text()' to get all values

    import requests
    import lxml.html
    
    text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text
    
    s = lxml.html.fromstring(text)
    
    table = s.xpath('//table')[1]
    
    for row in table.xpath('./tr'):
        cells = row.xpath('./td//text()')
        print(cells)
    
    print(lxml.html.tostring(table, pretty_print=True).decode())
    

    Result

    ['Fare', ' DMRC Rs. 30']
    ['Time', '0:14']
    ['First', '6:03']
    ['Last', '22:24']
    ['Phone ', '8800793196']
    
    <table class="table">
    <tr>
    <td title="Monday To Saturday">Fare</td>
    <td><div> DMRC Rs. 30</div></td>
    </tr>
    <tr>
    <td>Time</td>
    <td>0:14</td>
    </tr>
    <tr>
    <td>First</td>
    <td>6:03</td>
    </tr>
    <tr>
    <td>Last</td>
    <td>22:24</td>
    </tr>
    <tr>
    <td>Phone </td>
    <td><a href="tel:8800793196">8800793196</a></td>
    </tr>
    </table>
    

    Oryginal HTML for comparition - there are missing closing tags

    <table class='table'>
    <tr><td  title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
    <tr><td>Time<td>0:14</tr>
    <tr><td>First<td>6:03</tr>
    <tr><td>Last<td>22:24
    <tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
    </table>