Search code examples
pythonpandasweb-scrapingbeautifulsoupjupyter-notebook

How to include colspan to a table header while web scraping


I am new to web scraping, I am trying to web scrape this website the 2022 Forbes Table - https://en.wikipedia.org/wiki/List_of_largest_companies_in_India , but the Rank column and the Forbes Rank column both have colspan - 2 so the number of table header is now - 9 but the info for these table is now - 11 so when I trying to insert the info to their corresponding header I am getting an error (can not set a row with mismatched columns).

So how do I set the colspan for rank and forbes rank?

Here is my code:

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_India'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html')

soup.find('table')
table = soup.find('table')
titles = table.find_all('th')
Table_Title = [title.text.strip() for title in titles]

import pandas as pd
df = pd.DataFrame(columns = Table_Title)
df

column_data = table.find_all('tr')
column_data

for row in column_data[1:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    length = len(df)
    df.loc[length] = individual_row_data 
    print(individual_row_data)

Solution

  • Since you want to create a dataframe and have already included pandas, the most elegant solution would be to use [pandas.read_html()][1] for scraping the table data:

    pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_in_India')[0]
    
    Rank Rank.1 Forbes 2000 rank Forbes 2000 rank.1 Name Headquarters Revenue (billions US$) Profit (billions US$) Assets (billions US$) Value (billions US$) Industry
    0 1 (0) 54 (+1) Reliance Industries Mumbai 86.85 7.81 192.59 228.63 Conglomerate
    ...
    50 51 (0) 1759 (+208) DMart Mumbai 4 0.20 1.93 34.12 Retail
    51 52 (0) 1759 (+208) Adyar Ananda Bhavan Chennai 4 0.20 1.93 34.12 Retail

    In alternative you could use your beautifulsoup selection and multiply the title by value of colspan:

    Table_Title = []
    
    for title in table.find_all('th'):
        if title.get('colspan'):
            Table_Title.extend([title.get_text(strip=True)]*int(title.get('colspan')))
        else:
            Table_Titlepandas.pydata.org/docs/reference/api/pandas.read_html.html