Search code examples
pythonselenium-webdriverweb-scrapingpython-requestspython-requests-html

Get url of link using Python web scraping; requests, requests_html, selenium


I'm new to web scraping, and I'm having issues getting a link to data from a USGS earthquake's did you feel it page. The url I'm trying to get the data from is: https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity

I'm trying to automate the pickup of this data so I don't have to manually pick it up after each earthquake. The url for the data that I'm trying to pull is consistent except for the earthquakes id, which I have, and a number that doesn't seem to be tied to anything, and so I thought I could just get the url with web scraping.

If you look at the page there is a drop down menu called downloads with different data products. I am trying to get the url for the DYFI Geospatial Data, UTM aggregated(10 km spacing) so I can pull the geojson file using curl.

I don't know much about web scraping or html code, and most of what I've tried has been based on what I've found here and on youtube.

What I've tried:

I tried using requests to get the html and parse it with beautiful soup, but the page is dynamically generated so the html that came over didn't include what I was looking for.

import requests
import bs4 #beautiful soup

res = requests.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
    print(link)

This outputs three links, but not the one I need:

<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and Web Services</a>
<a href="https://angular.io/guide/browser-support">view supported
            browsers</a>
<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and
            Web Services</a>

I think that the USGS site uses javascript to populate the drop down downloads menu which is why the regular requests method didn't work, and so I thought that I might try to use selenium instead. I hoped that it would give me the html that I can see when I use the inspect element tool, but I didn't have any luck.

from selenium import webdriver
path = "/Users/jon/Desktop/selenium_webdriver/chromedriver" #path to chromedriver on my machine
driver = webdriver.Chrome(executable_path=path)
driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
html_eq = driver.page_source
soup = bs4.BeautifulSoup(html_eq, 'html.parser')
for link in soup.find_all('a'):
    print(link) 

This outputs more links than my original attempt, but doesn't get me the link I'm looking for. Here is the output of my selenium attempt:

<a _ngcontent-fgi-c8="" class="hazdev-site-logo" href="/" title="U.S. Geological Survey"><img _ngcontent-fgi-c8="" alt="U.S. Geological Survey logo" src="assets/usgs-logo.svg"/></a>
<a _ngcontent-fgi-c8="" class="hazdev-jumplink-navigation" href="#site-sectionnav">Jump to Navigation</a>
<a _ngcontent-fgi-c5="" class="up-one-level ng-star-inserted" href="/earthquakes/map/" templatesidenavigation=""> Latest Earthquakes </a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/executive" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Overview </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Interactive Map </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/region-info" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Regional Information </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Impact </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/tellus" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Felt Report - Tell Us! </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted active-link" href="/earthquakes/eventpage/us7000bi0e/dyfi" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Did You Feel It? </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/technical" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Technical </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/origin" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Origin </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/waveforms" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Waveforms </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/feed/v1.0/detail/us7000bi0e.kml" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Download Event KML </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/map/#%7B%22autoUpdate%22%3Afalse%2C%22basemap%22%3A%22terrain%22%2C%22event%22%3A%22us7000bi0e%22%2C%22feed%22%3A%22us7000bi0e%22%2C%22mapposition%22%3A%5B%5B6.104279985601153%2C-85.06432001439885%5D%2C%5B10.603920014398849%2C-80.56467998560115%5D%5D%2C%22search%22%3A%7B%22id%22%3A%22us7000bi0e%22%2C%22isSearch%22%3Atrue%2C%22name%22%3A%22Search%20Results%22%2C%22params%22%3A%7B%22endtime%22%3A%222020-09-25T17%3A46%3A43.975Z%22%2C%22latitude%22%3A8.3541%2C%22longitude%22%3A-82.8145%2C%22maxradiuskm%22%3A250%2C%22minmagnitude%22%3A2%2C%22starttime%22%3A%222020-08-14T17%3A46%3A43.975Z%22%7D%7D%7D" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> View Nearby Seismicity </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Earthquakes </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/hazards/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Hazards </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/data/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Data &amp; Products </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/learn/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Learn </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/monitoring/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Monitoring </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/research/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Research </div></a>
<a _ngcontent-fgi-c18="" class="tell-us-link" href="/earthquakes/eventpage/us7000bi0e/tellus" queryparamshandling="preserve"> Felt Report - Tell Us! </a>
<a _ngcontent-fgi-c22=""> View all dyfi products (1 total) </a>
<a _ngcontent-fgi-c20="" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity"> US </a>
<a _ngcontent-fgi-c18="" aria-current="true" aria-disabled="false" class="mat-tab-link ng-star-inserted mat-tab-label-active" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/zip" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> ZIP Map </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity-vs-distance" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity Vs. Distance </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses-vs-time" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Responses Vs. Time </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> DYFI Responses </a>
<a _ngcontent-fgi-c28="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map?dyfi-responses-10km=true&amp;shakemap-intensity=false"><img _ngcontent-fgi-c28="" alt="DYFI intensity map" src="https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/us7000bi0e_ciim_geo.jpg"/></a>
<a _ngcontent-fgi-c23="" href="/earthquakes/eventpage/us7000bi0e">Overview</a>
<a _ngcontent-fgi-c32="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact"> Impact Summary </a>
<a _ngcontent-fgi-c18="" href="https://earthquake.usgs.gov/data/dyfi/">Scientific Background for Did You Feel It?</a>
<a href="https://earthquake.usgs.gov/data/comcat/contributor/us/">USGS National Earthquake Information Center, PDE</a>
<a _ngcontent-fgi-c7="" href="/data/comcat/"> ANSS Comprehensive Earthquake Catalog (ComCat) Documentation </a>
<a _ngcontent-fgi-c7="" href="/data/comcat/data-eventterms.php"> Technical terms used on event pages </a>
<a _ngcontent-fgi-c11="" href="mailto:lisa%2Behpweb@usgs.gov">Questions or comments?</a>
<a _ngcontent-fgi-c11="" class="facebook ng-star-inserted" href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Facebook">Facebook</a>
<a _ngcontent-fgi-c11="" class="twitter ng-star-inserted" href="https://twitter.com/intent/tweet?url=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity&amp;text=USGS%20%7C%20M 5.3 - 1 km NNW of Manaca Norte, Panama" title="Share using Twitter">Twitter</a>
<a _ngcontent-fgi-c11="" class="email ng-star-inserted" href="mailto:lisa%2Behpweb@usgs.gov?to=&amp;subject=M 5.3 - 1 km NNW of Manaca Norte, Panama&amp;body=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Email">Email</a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/"> Home </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/aboutus/"> About Us </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/contactus/"> Contacts </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/legal.php"> Legal </a>

I found a youtube tutorial about web scraping using requests_html that I thought might work: https://www.youtube.com/watch?v=MeBU-4Xs2RU I can get the example he gives in the video to work with the beer website, but I haven't been able to apply it to my situation.

Here is the code I've tried,

from requests_html import HTMLSession

url_usgs = 'https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity'

r_usgs = s.get(url_usgs)

r_usgs.html.render(sleep=1)

downloads = r_usgs.html.xpath('//*[@id="mat-expansion-panel-header-0"]', first=True)
print(downloads.absolute_links)

This isn't returning anything though. I don't know html so it's possible that I'm selecting the wrong item's xpath to use.

If anyone has any ideas on how I can get the url for the 10km dyfi data from the downloads menu (https://earthquake.usgs.gov/archive/product/dyfi/us7000biji/us/1601214674370/dyfi_geo_10km.geojson), or could point me in the direction of some more in depth material on web scraping I would appreciate it.


Solution

  • You need to click on the "Downloads" menu in order to expand the content.

    from selenium import webdriver
    from selenium.common.exceptions import NoSuchElementException
    import time
    
    
    driver = webdriver.Chrome()
    driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
    
    # get a reference to the download menu. This will run before the page has 
    # finished loading, so we stick it in a while loop and just keep looping
    # until we're successful.
    while True:
        try:
            download_menu = driver.find_element_by_id('mat-expansion-panel-header-0')
        except NoSuchElementException:
            time.sleep(0.2)
            continue
        else:
            break
    
    # click on the download menu to expand the content
    download_menu.click()
    
    while True:
        try:
            downloads = driver.find_element_by_id('cdk-accordion-child-0')
        except NoSuchElementException:
            time.sleep(0.2)
            continue
        else:
            break
    
    links = downloads.find_elements_by_css_selector('a')
    geojson = [link for link in links if 'geojson' in link.text.lower()]
    
    for link in geojson:
        print(link.text, ':', link.get_attribute('href'))
    
    
    driver.close()
    

    Which will produce:

    GEOJSON 645.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_zip.geojson
    GEOJSON 844.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_1km.geojson
    GEOJSON 1.0 KB : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_10km.geojson
    

    ...and of course you could inspect the value of the href attributes to find the 10km data (by looking for the one that contains 10km in the link).