Search code examples
pythonweb-scrapingbeautifulsouphtml-tablepython-requests

How to parse an html table with a fixed shape?


I receive an html table that have always the same shape. Only the values differ in each time.

html = '''
<table align="center">
    <tr>
        <th>Name</th>
        <td>NAME A</td>
        <th>Status</th>
        <td class="IN PROGRESS">IN PROGRESS</td>
    </tr>
    <tr>
        <th>Category</th>
        <td COLSPAN="3">CATEGORY A</td>
    </tr>
    <tr>
        <th>Creation date</th>
        <td>13/01/23 23:00</td>
        <th>End date</th>
        <td></td>
    </tr>
</table>
'''

I need to convert it to a dataframe but pandas is giving me a weird format :

print(pd.read_html(html)[0])

               0               1           2            3
0           Name          NAME A      Status  IN PROGRESS
1       Category      CATEGORY A  CATEGORY A   CATEGORY A
2  Creation date  13/01/23 23:00    End date          NaN

I feel like we need to use beautifulsoup but I'm not sure how :

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

Can you guys help me with that ?

My expected output is this dataframe :

     Name    Category   Status     Creation date  End date
0  NAME A  CATEGORY A  RUNNING  27/07/2023 11:43       NaN

Solution

  • Based on your example you could iterate the <td>´s and store its text with its previous sibling <th> in a dict and create your dataframe:

    {e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}
    
    Example
    from bs4 import BeautifulSoup
    import pandas as pd
    
    html = '''
    <table align="center">
        <tr>
            <th>Name</th>
            <td>NAME A</td>
            <th>Status</th>
            <td class="IN PROGRESS">IN PROGRESS</td>
        </tr>
        <tr>
            <th>Category</th>
            <td COLSPAN="3">CATEGORY A</td>
        </tr>
        <tr>
            <th>Creation date</th>
            <td>13/01/23 23:00</td>
            <th>End date</th>
            <td></td>
        </tr>
    </table>
    '''
    
    soup = BeautifulSoup(html)
    
    pd.DataFrame(
        [
            {e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}
        ]
    )
    
    Result
    Name Status Category Creation date End date
    0 NAME A IN PROGRESS CATEGORY A 13/01/23 23:00