Search code examples
pythonpython-requests-html

Python Requests-HTML - Can't find specific data


I am trying to scrape a web page using python requests-html library.
link to that web page is https://www.koyfin.com/charts/g/USADebt2GDP?view=table , below image shows (red rounded data) the data what i want to get.

enter image description here

My code is like this,

from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.koyfin.com/charts/g/USADebt2GDP?view=table')
r.html.render(timeout=60)
print(r.text)

web page html like this,

enter image description here

Problem is when i scrape the web page i can't find the data i want, in HTML code i can see the data inside first div tags in body section. Any specific suggestions for how to solve this.

Thanks.


Solution

  • The problem is that the data is being loaded by JavaScript code after the initial page load. One solution is to use Selenium to drive a web browser to scrape the page. But using a regular browser I looked at the network requests that were being made and it appears that the data you seek is being loaded with the following AJAX call:

    https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly
    

    So:

    import requests
    
    response = requests.get('https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly')
    results = response.json();
    print(results)
    for t in results['graph']['data']:
        print(t)
    

    Prints:

    {'ticker': 'USADebt2GDP', 'companyName': 'United States Gross Federal Debt to GDP', 'startDate': '1940-12-31T00:00:00.000Z', 'endDate': '2019-12-31T00:00:00.000Z', 'unit': 'percent', 'graph': {'column_names': ['Date', 'Volume'], 'data': [['2010-12-31', 91.4], ['2011-12-31', 96], ['2012-12-31', 100.1], ['2013-12-31', 101.2], ['2014-12-31', 103.2], ['2015-12-31', 100.8], ['2016-12-31', 105.8], ['2017-12-31', 105.4], ['2018-12-31', 106.1], ['2019-12-31', 106.9]]}, 'withoutLiveData': True}
    ['2010-12-31', 91.4]
    ['2011-12-31', 96]
    ['2012-12-31', 100.1]
    ['2013-12-31', 101.2]
    ['2014-12-31', 103.2]
    ['2015-12-31', 100.8]
    ['2016-12-31', 105.8]
    ['2017-12-31', 105.4]
    ['2018-12-31', 106.1]
    ['2019-12-31', 106.9]
    

    How I Came Up with the URL

    enter image description here

    And when you click on the last message:

    enter image description here