python python-3.x web-scraping beautifulsoup

How to scrape the accurate fields from HTML tables of varying lengths?

I'm trying to create a script to scrape a few fields from tables based on their headers. The problem is that all the tables are not of the same length, along with their headers.

Here are the HTML elements of the two tables with different lengths for your consideration.

from bs4 import BeautifulSoup

html_table1 = """
<thead>
<tr>
    <td rowspan="1" data-sort-key="survey_date_surrkey">&nbsp;</td>
    <th scope="col" colspan="1" data-sort-key="cell-1"><span>2 Bedroom</span></th>
    <th scope="col" colspan="1" data-sort-key="cell-2"><span>3 Bedroom +</span></th>
    <th scope="col" colspan="1" data-sort-key="cell-3"><span>Total</span></th>
</tr>            
</thead>
<tbody>
<tr>
    <th scope="row" class="first-cell" data-field="survey_date_surrkey" data-value="0">2010 October</th>
    <td class="numericalData" data-field="cell-1" data-value="23">23</td>
    <td class="numericalData" data-field="cell-2" data-value="5">5</td>
    <td class="numericalData" data-field="cell-3" data-value="28">28</td>
</tr>
</tbody>
"""

html_table = """
<thead>
<tr>
    <td rowspan="1" data-sort-key="survey_date_surrkey">&nbsp;</td>
    <th scope="col" colspan="2" data-sort-key="cell-1"><span>Bachelor</span></th>
    <th scope="col" colspan="2" data-sort-key="cell-3"><span>1 Bedroom</span></th>
    <th scope="col" colspan="2" data-sort-key="cell-5"><span>2 Bedroom</span></th>
    <th scope="col" colspan="2" data-sort-key="cell-7"><span>3 Bedroom +</span></th>
    <th scope="col" colspan="2" data-sort-key="cell-9"><span>Total</span></th>
</tr>            
</thead>
<tbody>
<tr>
    <th scope="row" class="first-cell" data-field="survey_date_surrkey" data-value="0">1990 October</th>
    <td class="numericalData" data-field="cell-1" data-value="0.0">0.0</td>
    <td class="" data-field="cell-2" data-value="b ">b </td>
    <td class="numericalData" data-field="cell-3" data-value="0.2">0.2</td>
    <td class="" data-field="cell-4" data-value="a ">a </td>
    <td class="numericalData" data-field="cell-5" data-value="0.0">0.0</td>
    <td class="" data-field="cell-6" data-value="b ">b </td>
    <td class="numericalData" data-field="cell-7" data-value="0.0">0.0</td>
    <td class="" data-field="cell-8" data-value="b ">b </td>
    <td class="numericalData" data-field="cell-9" data-value="0.1">0.1</td>
    <td class="" data-field="cell-10" data-value="a ">a </td>
</tr>
</tbody>
"""

soup = BeautifulSoup(html_table1,'html.parser')

headers = [header.text.strip() for header in soup.select('thead th')]
non_class_presence = [item.text.strip() for item in soup.select("tbody tr td[class='']")]
index_num = len(soup.select("td.numericalData"))

try:
    bachelor = soup.select("td.numericalData")[0].text
except IndexError: bachelor = ""
try:
    one_bed = soup.select("td.numericalData")[1].text
except IndexError: one_bed = ""
try:
    two_bed = soup.select("td.numericalData")[2].text
except IndexError: two_bed = ""
try:
    three_bed = soup.select("td.numericalData")[3].text
except IndexError: three_bed = ""
try:
    total_bed = soup.select("td.numericalData")[4].text
except IndexError: total_bed = ""
print(f"bachelor: {bachelor}\none_bed: {one_bed}\ntwo_bed: {two_bed}\nthree_bed: {three_bed}\ntotal_bed: {total_bed}")

When I run the script using the second html_table, I get the following output: (correct result)

bachelor: 0.0
one_bed: 0.2
two_bed: 0.0
three_bed: 0.0
total_bed: 0.1

However, the script produces errorneous results when I run it using html_table1: (wrong result)

bachelor: 23
one_bed: 5
two_bed: 28
three_bed: 
total_bed:

Correct output for html_table1:

bachelor: 
one_bed: 
two_bed: 23
three_bed: 5
total_bed: 28

How can I change the script's logic such that, despite the table's and its headers' length, the results are accurately parsed?

Solution

You are close to a propper result, just select your numeric values more precisely, similar to what you did for the non_class_presence and zip() the results:

headers = [header.get_text(strip=True) for header in soup.select('thead th')]
numeric_values = [item.get_text(strip=True) for item in soup.select("tbody tr td.numericalData")]

dict(zip(headers,numeric_values))

I replaced .text.strip() what is not wrong with .get_text(strip=True) because it is closer to the docs

Result for html_table

{'Bachelor': '0.0',
 '1 Bedroom': '0.2',
 '2 Bedroom': '0.0',
 '3 Bedroom +': '0.0',
 'Total': '0.1'}

Result for html_table1

{'2 Bedroom': '23', '3 Bedroom +': '5', 'Total': '28'}

An alternative and at least my first choice when scraping table data is the use of pandas.read_html (note, however, that the sample data should contain valid HTML and the <table> should be supplemented)

import pandas as pd
df = pd.read_html(html_table)[0]
df[[c for c in df.columns if not any(x in c for x in ['.',':'])]].to_json(orient='records')