Search code examples
ajaxseleniumweb-scrapingxmlhttprequestselenium-chromedriver

Scrape JSON response with Selenium Browser


I want to make the following actions automatically:

  • I open a web page with Google Chrome.
  • wait for it to render all the needed information.
  • go to Inspect Element, tab Network, and look at XHR requests.
  • find the file that I need.
  • copy the content of its response (to save it in a txt file).

It's kind of web scraping, but with less effort (how I think).

The problem is that I can't find what tools allow me to do that.

I started with Python and Selenium (chrome driver). But didn't found any info, is it possible to get XHR responses or not. All the tutorials are about scraping HTML. It seems logic to be possible, but my research didn't help.

Any idea?

Thank you.


Solution

  • The website you are trying to scrape has a dynamically generated content by JavaScript .

    You have two options to work your way around that

    1. Simulate a human browser interaction using selenium and open the website then wait till all the content is rendered and then use selenium to Extract the data you seek . this approach deals with the Elements tab. you just use css or xpath selectors to get the tags you want

    2. instead of finding a way to make selenium go to network tab and save the content ( which you will find extremely hard to do ) you should get the URL of the XHR request and build the same request with the same headers and parameters if any exists and then use requests to send that request and you can save the content easily .

    Let's try to scrape Home | Microsoft Academic

    First approach :

    from selenium import webdriver
    
    driver = webdriver.Chrome() # Launch the browser 
    driver.get("https://academic.microsoft.com/home") # Go to the given url
    authors = driver.find_elements_by_xpath('//a[@data-appinsights-action="TopAuthorSelected"]') # get the elements using selectors
    for author in authors: # loop through them 
        print(author.text)
    

    Output :

    1. Yoshua Bengio
    2. Geoffrey E. Hinton
    3. Andrew Zisserman
    4. Ilya Sutskever
    5. Jian Sun
    6. Trevor Darrell
    7. Scott Shenker
    8. Jiawei Han
    9. Kaiming He
    10. Ross Girshick
    11. Ion Stoica
    12. Hari Balakrishnan
    13. R Core Team
    14. Jitendra Malik
    15. Jeffrey Dean
    

    Second approach :

    import requests 
    res = requests.get('https://academic.microsoft.com/api/analytics/authors/topauthors?topicId=41008148&take=15&filter=1&dateRange=1').json()
    #The XHR Response is Usually in Json format
    #res = [{'name': 'Yoshua Bengio', 'id': '161269817', 'lat': 0.0, 'lon': 0.0}, {'name': 'Geoffrey E. Hinton', 'id': '563069026', 'lat': 0.0, 'lon': 0.0}, {'name': 'Andrew Zisserman', 'id': '2469405535', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ilya Sutskever', 'id': '215131072', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jian Sun', 'id': '2200192130', 'lat': 0.0, 'lon': 0.0}, {'name': 'Trevor Darrell', 'id': '2174985400', 'lat': 0.0, 'lon': 0.0}, {'name': 'Scott Shenker', 'id': '719828399', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jiawei Han', 'id': '2121939561', 'lat': 0.0, 'lon': 0.0}, {'name': 'Kaiming He', 'id': '2164292938', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ross Girshick', 'id': '2473549963', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ion Stoica', 'id': '2161479384', 'lat': 0.0, 'lon': 0.0}, {'name': 'Hari Balakrishnan', 'id': '1998464616', 'lat': 0.0, 'lon': 0.0}, {'name': 'R Core Team', 'id': '2976715238', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jitendra Malik', 'id': '2136556746', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jeffrey Dean', 'id': '2429370538', 'lat': 0.0, 'lon': 0.0}]
    for author in res:
        print(author['name'])
    

    Output:

    Yoshua Bengio
    Geoffrey E. Hinton
    Andrew Zisserman
    Ilya Sutskever
    Jian Sun
    Trevor Darrell
    Scott Shenker
    Jiawei Han
    Kaiming He
    Ross Girshick
    Ion Stoica
    Hari Balakrishnan
    R Core Team
    Jitendra Malik
    Jeffrey Dean
    

    Second approach saves time , resources and straight forward .

    Using First approach Image

    Using Second approach Image