python web-scraping beautifulsoup html-table python-requests

How to parse an html table with a fixed shape?

I receive an html table that have always the same shape. Only the values differ in each time.

html = '''
<table align="center">
    <tr>
        <th>Name</th>
        <td>NAME A</td>
        <th>Status</th>
        <td class="IN PROGRESS">IN PROGRESS</td>
    </tr>
    <tr>
        <th>Category</th>
        <td COLSPAN="3">CATEGORY A</td>
    </tr>
    <tr>
        <th>Creation date</th>
        <td>13/01/23 23:00</td>
        <th>End date</th>
        <td></td>
    </tr>
</table>
'''

I need to convert it to a dataframe but pandas is giving me a weird format :

print(pd.read_html(html)[0])

               0               1           2            3
0           Name          NAME A      Status  IN PROGRESS
1       Category      CATEGORY A  CATEGORY A   CATEGORY A
2  Creation date  13/01/23 23:00    End date          NaN

I feel like we need to use beautifulsoup but I'm not sure how :

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

Can you guys help me with that ?

My expected output is this dataframe :

     Name    Category   Status     Creation date  End date
0  NAME A  CATEGORY A  RUNNING  27/07/2023 11:43       NaN

Solution

Based on your example you could iterate the <td>´s and store its text with its previous sibling <th> in a dict and create your dataframe:

{e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}

Example

from bs4 import BeautifulSoup
import pandas as pd

html = '''
<table align="center">
    <tr>
        <th>Name</th>
        <td>NAME A</td>
        <th>Status</th>
        <td class="IN PROGRESS">IN PROGRESS</td>
    </tr>
    <tr>
        <th>Category</th>
        <td COLSPAN="3">CATEGORY A</td>
    </tr>
    <tr>
        <th>Creation date</th>
        <td>13/01/23 23:00</td>
        <th>End date</th>
        <td></td>
    </tr>
</table>
'''

soup = BeautifulSoup(html)

pd.DataFrame(
    [
        {e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}
    ]
)

Result

	Name	Status	Category	Creation date	End date
0	NAME A	IN PROGRESS	CATEGORY A	13/01/23 23:00