Search code examples
javascriptpythonblockchainpython-requests-html

Webscraping Blockchain data seemingly embedded in Javascript through Python, is this even the right approach?


I'm referencing this url: https://tracker.icon.foundation/block/29562412

If you scroll down to "Transactions", it shows 2 transactions with separate links, that's essentially what I'm trying to grab. If I try a simple pd.read_csv(url) command, it clearly omits the data I'm looking for, so I thought it might be JavaScript based and tried the following code instead:

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://tracker.icon.foundation/block/29562412')
r.html.links
r.html.absolute_links

and I get the result "set()" even though I was expecting the following:

['https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632', 'https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450bb156af4a256fa54426b0e0e6a3e']

Is JavaScript even the right approach? I tried BeautifulSoup instead and found no cigar on that end as well.


Solution

  • You're right. This page is populated asynchronously using JavaScript, so BeautifulSoup and similar tools won't be able to see the specific content you're trying to scrape.

    However, if you log your browser's network traffic, you can see some (XHR) HTTP GET requests being made to a REST API, which serves its results in JSON. This JSON happens to contain the information you're looking for. It actually makes several such requests to various API endpoints, but the one we're interested in is called txList (short for "transaction list" I'm guessing):

    def main():
    
        import requests
    
        url = "https://tracker.icon.foundation/v3/block/txList"
    
        params = {
            "height": "29562412",
            "page": "1",
            "count": "10"
        }
    
        response = requests.get(url, params=params)
        response.raise_for_status()
    
        base_url = "https://tracker.icon.foundation/transaction/"
    
        for transaction in response.json()["data"]:
            print(base_url + transaction["txHash"])
    
        return 0
    
    
    if __name__ == "__main__":
        import sys
        sys.exit(main())
    

    Output:

    https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632
    https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450bb156af4a256fa54426b0e0e6a3e
    >>>