Search code examples
pythonpandasbeautifulsouphtml-parsing

Failed to extract html table data using Beautiful Soup in Python


I am trying to replicate this code and to make some graphs, but I failed to get the csv file. I ran the exact same code but no avail as it print empty dataframe.

The code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import geopandas as gpd
from prettytable import PrettyTable

url = 'https://www.mohfw.gov.in/'
# make a GET request to fetch the raw HTML content
web_content = requests.get(url).content

# parse the html content
soup = BeautifulSoup(web_content, "html.parser")

# remove any newlines and extra spaces from left and right
extract_contents = lambda row: [x.text.replace('\n', '') for x in row]

# find all table rows and data cells within
stats = [] 
all_rows = soup.find_all('tr')
for row in all_rows:
    stat = extract_contents(row.find_all('td')) 
# notice that the data that we require is now a list of length 5
    if len(stat) == 5:
        stats.append(stat)

#now convert the data into a pandas dataframe for further processing
new_cols = ["Sr.No", "States/UT","Confirmed","Recovered","Deceased"]
state_data = pd.DataFrame(data = stats, columns = new_cols)
state_data.head()

Any help is appreciated.


Solution

  • You can get all that data from there URI which allows to return JSON. You will need to map some column names and then do calculations with returned columns to derive the changes since yesterday. columns with new_ are today's values.

    import pandas as pd
    import requests
    
    r = requests.get('https://www.mohfw.gov.in/data/datanew.json').json()
    df = pd.DataFrame(r)
    df