I'm trying to scrape headers of a table from a webpage using list comprehension. The problem I'm facing is that when I create the same headers using pandas, the appearance is vastly different. Just to inform you, the headers created by padas address all columns
.
from bs4 import BeautifulSoup
import pandas as pd
import requests
link = 'https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/TableMatchingCriteria?GeographyType=SurveyZone&GeographyId=011002&CategoryLevel1=Primary%20Rental%20Market&CategoryLevel2=Vacancy%20Rate%20%28%25%29&ColumnField=2&RowField=TIMESERIES'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
col_headers = [item.text.replace("\xa0","Unnamed: 0") for i in soup.select("table.CawdDataTable thead tr") for item in i.select("th, td")]
print(col_headers)
selector = soup.select_one("table.CawdDataTable")
df = pd.read_html(str(selector))[0]
print(list(df.columns))
Output: The first one is created by list comprehension, while the second one is created by pandas.
['Unnamed: 0', 'Bachelor', '1 Bedroom', '2 Bedroom', '3 Bedroom +', 'Total']
['Unnamed: 0', 'Bachelor', 'Bachelor.1', '1 Bedroom', '1 Bedroom.1', '2 Bedroom', '2 Bedroom.1', '3 Bedroom +', '3 Bedroom +.1', 'Total', 'Total.1']
Can I create similar column headers using list comprehension as pandas does?
IIUC you can do:
from io import StringIO
import pandas as pd
import requests
from bs4 import BeautifulSoup
link = "https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/TableMatchingCriteria?GeographyType=SurveyZone&GeographyId=011002&CategoryLevel1=Primary%20Rental%20Market&CategoryLevel2=Vacancy%20Rate%20%28%25%29&ColumnField=2&RowField=TIMESERIES"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
table = soup.table
# remove colspan="2" from the <th>
for th in table.select("th[colspan]"):
del th.attrs["colspan"]
# join every two <td>
for tr in table.select("tbody tr"):
tds = tr.select("td")
for i in range(0, len(tds), 2):
td1, td2 = tds[i], tds[i + 1]
td1.insert_before(
BeautifulSoup(f"<td>{td1.text} {td2.text}</td>", "html.parser")
)
td1.extract()
td2.extract()
df = pd.read_html(StringIO(str(table)))[0]
df = df.set_index(df.columns[0]).rename_axis(index=None)
print(df)
Prints:
Bachelor 1 Bedroom 2 Bedroom 3 Bedroom + Total
2010 October 13.3 a 9.6 a 6.2 a ** 11.0 a
2011 October 13.3 a 14.9 a 8.7 a ** 11.1 a
2012 October 13.3 a 8.0 a 6.4 a ** 7.9 a
2013 October 6.7 a 5.1 b 4.7 a ** 4.0 a
2014 October 0.0 a 5.8 a 4.2 a ** 3.8 a
2015 October ** 4.0 b 2.7 b ** 2.6 a
2016 October 0.0 a 1.8 b 0.8 a ** 1.4 a
2017 October 0.0 a 1.5 a 0.0 d ** 0.5 a
2018 October ** 3.4 d 0.6 a ** 1.6 b
2019 October 0.0 a 0.0 c 0.5 a ** 0.2 a
2020 October ** 1.4 d 0.7 a ** 0.8 a
2021 October ** ** 2.5 b ** 2.0 b
2022 October ** 5.0 c 10.8 d ** 7.3 a
2023 October ** ** ** ** **