Search code examples
pythonbeautifulsouphtml-table

How to parse html table in python


I'm newbie in parsing tables and regular expressions, can you help to parse this in python:

<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>

I need the "3text" and "6text"


Solution

  • You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

    import requests
    from bs4 import BeautifulSoup
    html_doc='''
    <table callspacing="0" cellpadding="0">
        <tbody><tr>
        <td>1text&nbsp;2text</td>
        <td>3text&nbsp;</td>
        </tr>
        <tr>
        <td>4text&nbsp;5text</td>
        <td>6text&nbsp;</td>
        </tr>
    </tbody></table>
    '''
    
    soup = BeautifulSoup(html_doc, 'lxml')
    soup1 = soup.select('tr')
    
    for i in soup1:
        print(i.select_one('td:nth-child(2)').text)
    

    You can also use find_all method:

    trs = soup.find('table').find_all('tr')
    
    for i in trs:
        tds = i.find_all('td')
        print(tds[1].text)
    

    Result:

    3text 
    6text