pandas dataframe web-scraping beautifulsoup html-parsing

How to scrape specific columns from table with BeautifulSoup and return as pandas dataframe

Trying to parse the table with HDI and load the data into the Pandas DataFrame with columns: Country, HDI_score.

I'm stuck with loading the Nation column with the following code:

import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')

df = pd.DataFrame(columns=['Countries', 'HDI_score'])
for row in table.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        countries = columns[1].text.strip()
        hdi_score = columns[2].text.strip()
        df = df.append({'Countries': countries, 'HDI_score': hdi_score}, ignore_index=True)

Result from my code

So instead of having names of countries, I have values from column 'Rank changes over 5 years'. I've tried changing the column's index but it didn't help.

Solution

You could use pandas to grab the appropriate table, match='Rank' getting you the right table, then extract the columns of interest.

import pandas as pd

table = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index', match='Rank')[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)

As per comments, I see little point involving bs4 if you are still using pandas. See as below:

import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(th:contains("Rank"))')))[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)