Search code examples
pandasdataframeweb-scrapingbeautifulsouphtml-parsing

How to scrape specific columns from table with BeautifulSoup and return as pandas dataframe


Trying to parse the table with HDI and load the data into the Pandas DataFrame with columns: Country, HDI_score.

I'm stuck with loading the Nation column with the following code:

import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')

df = pd.DataFrame(columns=['Countries', 'HDI_score'])
for row in table.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        countries = columns[1].text.strip()
        hdi_score = columns[2].text.strip()
        df = df.append({'Countries': countries, 'HDI_score': hdi_score}, ignore_index=True)

Result from my code

So instead of having names of countries, I have values from column 'Rank changes over 5 years'. I've tried changing the column's index but it didn't help.


Solution

  • You could use pandas to grab the appropriate table, match='Rank' getting you the right table, then extract the columns of interest.

    import pandas as pd
    
    table = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index', match='Rank')[0]
    columns = ['Nation','HDI']
    table = table.loc[:, columns].iloc[:, :2]
    table.columns = columns
    print(table)
    

    As per comments, I see little point involving bs4 if you are still using pandas. See as below:

    import pandas as pd
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index')
    soup = bs(r.content, 'lxml')
    table = pd.read_html(str(soup.select_one('table:has(th:contains("Rank"))')))[0]
    columns = ['Nation','HDI']
    table = table.loc[:, columns].iloc[:, :2]
    table.columns = columns
    print(table)