Search code examples
pythonhtmlexcelbeautifulsoupurllib

I need help extracting embedded .xlsx link from a webpage using Python/BeautifulSoup


I'm trying to access an excel table from this website to bring in as a DataFrame. Here is what I have:

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://tedb.ornl.gov/data/'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

# Select all 'a' elements with href attributes containing URLs starting with https://
for link in soup.select('a[href^="https://"]'):
    href = link.get('href')
    print(href)

I'd like to grab Table 4.01, whose link, when inspected, is contained within the HTML element:

<a href="https://tedb.ornl.gov/wp-content/uploads/2020/06/Table4_01_06242020.xlsx">xlsx</a>

However, when I run my code, all I get back are the links below:

https://www.ornl.gov
https://tedb.ornl.gov/
https://tedb.ornl.gov/data/
https://tedb.ornl.gov/archive/
https://tedb.ornl.gov/citation/
https://tedb.ornl.gov/contact/
https://tedb.ornl.gov/wp-content/uploads/2020/02/TEDB_Ed_38.pdf
https://tedb.ornl.gov/wp-content/uploads/2020/08/TEDB_38.2_Spreadsheets_08312020.zip
https://tedb.ornl.gov/wp-content/uploads/2020/08/Updates_08312020.pdf
https://www.ornl.gov/ornl/contact-us/Security--Privacy-Notice
https://www.ornl.gov/content/accessibility
https://www.ornl.gov/content/notice-nondiscrimination-and-accessibility-requirements

Does anyone know why the excel link I'm looking for does not show up?


Solution

  • The table is dynamically generated, but there's a back-end url you can query.

    Here's how:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://tedb.ornl.gov/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=3374&target_action=get-all-data&default_sorting=manual_sort"
    
    response = requests.get(url).json()
    
    for item in response:
        print(BeautifulSoup(item["value"]["excel"], "html.parser").find("a")["href"])
    

    Output:

    https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_01_04302020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_02_04302020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_03_04302020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_04_04302020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_01_08312020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_02_08312020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_03_08312020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/08/Table1_05_08312020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_06_04302020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_07_04302020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_08_04302020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_04_08312020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_09_04302020.xlsx
    https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_10_04302020.xlsx
    and much more...