Search code examples
pythonpandasweb-scrapingbeautifulsouppython-requests

Unable to create similar column headers using list comprehension as pandas does for a particular table


I'm trying to scrape headers of a table from a webpage using list comprehension. The problem I'm facing is that when I create the same headers using pandas, the appearance is vastly different. Just to inform you, the headers created by padas address all columns.

from bs4 import BeautifulSoup
import pandas as pd
import requests

link = 'https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/TableMatchingCriteria?GeographyType=SurveyZone&GeographyId=011002&CategoryLevel1=Primary%20Rental%20Market&CategoryLevel2=Vacancy%20Rate%20%28%25%29&ColumnField=2&RowField=TIMESERIES'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}

res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")

col_headers = [item.text.replace("\xa0","Unnamed: 0") for i in soup.select("table.CawdDataTable thead tr") for item in i.select("th, td")]
print(col_headers)

selector = soup.select_one("table.CawdDataTable")
df = pd.read_html(str(selector))[0]
print(list(df.columns))

Output: The first one is created by list comprehension, while the second one is created by pandas.

['Unnamed: 0', 'Bachelor', '1 Bedroom', '2 Bedroom', '3 Bedroom +', 'Total']
['Unnamed: 0', 'Bachelor', 'Bachelor.1', '1 Bedroom', '1 Bedroom.1', '2 Bedroom', '2 Bedroom.1', '3 Bedroom +', '3 Bedroom +.1', 'Total', 'Total.1']

Can I create similar column headers using list comprehension as pandas does?


Solution

  • IIUC you can do:

    from io import StringIO
    
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    link = "https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/TableMatchingCriteria?GeographyType=SurveyZone&GeographyId=011002&CategoryLevel1=Primary%20Rental%20Market&CategoryLevel2=Vacancy%20Rate%20%28%25%29&ColumnField=2&RowField=TIMESERIES"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest",
    }
    
    res = requests.get(link, headers=headers)
    soup = BeautifulSoup(res.text, "html.parser")
    
    table = soup.table
    
    # remove colspan="2" from the <th>
    for th in table.select("th[colspan]"):
        del th.attrs["colspan"]
    
    # join every two <td>
    for tr in table.select("tbody tr"):
        tds = tr.select("td")
    
        for i in range(0, len(tds), 2):
            td1, td2 = tds[i], tds[i + 1]
            td1.insert_before(
                BeautifulSoup(f"<td>{td1.text} {td2.text}</td>", "html.parser")
            )
            td1.extract()
            td2.extract()
    
    df = pd.read_html(StringIO(str(table)))[0]
    df = df.set_index(df.columns[0]).rename_axis(index=None)
    print(df)
    

    Prints:

                 Bachelor 1 Bedroom 2 Bedroom 3 Bedroom +   Total
    2010 October   13.3 a     9.6 a     6.2 a          **  11.0 a
    2011 October   13.3 a    14.9 a     8.7 a          **  11.1 a
    2012 October   13.3 a     8.0 a     6.4 a          **   7.9 a
    2013 October    6.7 a     5.1 b     4.7 a          **   4.0 a
    2014 October    0.0 a     5.8 a     4.2 a          **   3.8 a
    2015 October       **     4.0 b     2.7 b          **   2.6 a
    2016 October    0.0 a     1.8 b     0.8 a          **   1.4 a
    2017 October    0.0 a     1.5 a     0.0 d          **   0.5 a
    2018 October       **     3.4 d     0.6 a          **   1.6 b
    2019 October    0.0 a     0.0 c     0.5 a          **   0.2 a
    2020 October       **     1.4 d     0.7 a          **   0.8 a
    2021 October       **        **     2.5 b          **   2.0 b
    2022 October       **     5.0 c    10.8 d          **   7.3 a
    2023 October       **        **        **          **      **