Search code examples
pythonpandashtml-tablehtml-parsing

Parsing one table out of several tables contained in a html page using python


I am trying to parse a table inside an html page at this link, and I have yet to find a method ensuring I can point to the right table, as the page contains a few other tables as well - as shown on the image attached.

I have tried the simpler method, using pandas.read_html and let it figure it out, but this only returns the content of the top of the page (I am guessing), missing out everything else.

import pandas as pd
url='https://www.360optimi.com/app/sec/resourceType/benchmarkGraph?resourceSubTypeId=5c9316b28e202b46c92ca518&resourceId=envdecAluminumWindowProfAl&profileId=Saray2016&benchmarkToShow=co2_cml&entityId=5e4eae0f619e783ceb5d0732&indicatorId=lcaForLevels-CO2&stateIdOfProject='
tables = pd.read_html(url)
print(tables[0])

which returns:

            0         1         2
0     English  Français   Deutsch
1     Español     Suomi     Norsk
2  Nederlands   Svenska  Italiano

Any idea on how I can use the right html tags to point to the table of interest?

EDIT: As some of you noted that login credentials are required for the web page (apologies), I have uploaded the html code here.

screenshot of web page with code inspected


Solution

  • I have taken as input the html that you have provided. If you want to use this code on a url, just extract the html of that url before using this code

    from bs4 import BeautifulSoup
    import pandas as pd
    
    Your_input_html_string = str(html_code_of_your_url)
    
    soup = BeautifulSoup(Your_input_html_string) #Provide the html code of the url in string format as input over here
    
    #The table id which you want to extract from this html is "resourceBenchmarkTable". So let's extract the html of this table alone from the entire html
    extracted_table_html = str(soup.find_all("table",id="resourceBenchmarkTable"))
    
    #Now, convert the specific extracted html of table into pandas dataframe
    table_dataframe = pd.read_html(extracted_table_html)
    
    print(table_dataframe)
    

    Output: (Shows only first 5 rows to keep the answer short)

    enter image description here