I am new to web scraping, I am trying to web scrape this website the 2022 Forbes Table - https://en.wikipedia.org/wiki/List_of_largest_companies_in_India , but the Rank column and the Forbes Rank column both have colspan - 2 so the number of table header is now - 9 but the info for these table is now - 11 so when I trying to insert the info to their corresponding header I am getting an error (can not set a row with mismatched columns).
So how do I set the colspan for rank and forbes rank?
Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_India'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html')
soup.find('table')
table = soup.find('table')
titles = table.find_all('th')
Table_Title = [title.text.strip() for title in titles]
import pandas as pd
df = pd.DataFrame(columns = Table_Title)
df
column_data = table.find_all('tr')
column_data
for row in column_data[1:]:
row_data = row.find_all('td')
individual_row_data = [data.text.strip() for data in row_data]
length = len(df)
df.loc[length] = individual_row_data
print(individual_row_data)
Since you want to create a dataframe
and have already included pandas
, the most elegant solution would be to use [pandas.read_html()
][1] for scraping the table data:
pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_in_India')[0]
Rank | Rank.1 | Forbes 2000 rank | Forbes 2000 rank.1 | Name | Headquarters | Revenue (billions US$) | Profit (billions US$) | Assets (billions US$) | Value (billions US$) | Industry | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | (0) | 54 | (+1) | Reliance Industries | Mumbai | 86.85 | 7.81 | 192.59 | 228.63 | Conglomerate |
... | |||||||||||
50 | 51 | (0) | 1759 | (+208) | DMart | Mumbai | 4 | 0.20 | 1.93 | 34.12 | Retail |
51 | 52 | (0) | 1759 | (+208) | Adyar Ananda Bhavan | Chennai | 4 | 0.20 | 1.93 | 34.12 | Retail |
In alternative you could use your beautifulsoup
selection and multiply the title by value of colspan
:
Table_Title = []
for title in table.find_all('th'):
if title.get('colspan'):
Table_Title.extend([title.get_text(strip=True)]*int(title.get('colspan')))
else:
Table_Titlepandas.pydata.org/docs/reference/api/pandas.read_html.html