Search code examples
python-3.xweb-scrapingbeautifulsoupweb-crawleretherscan

How to build Etherscan webscraper?


I'm building a webscraper that constantly refreshes a buch of etherscan URL's every 30 seconds and if any new transfers have happened that are not accounted for, it sends me an email notification and a link to the relevant address on etherscan so I can manually check them out.

One of the addresses that I wanted to keep tabs on is here:

https://etherscan.io/token/0xd6a55c63865affd67e2fb9f284f87b7a9e5ff3bd?a=0xd071f6e384cf271282fc37eb40456332307bb8af

What I have done so far:

from urllib.request import Request, urlopen
url = 'https://etherscan.io/token/0xd6a55c63865affd67e2fb9f284f87b7a9e5ff3bd?a=0x94f52b6520804eced0accad7ccb93c73523af089'
req = Request(url, headers={'User-Agent': 'XYZ/3.0'})   # I got this line from another post since "uClient = uReq(URL)" and "page_html = uClient.read()" would not work (I beleive that etherscan is attemption to block webscraping or something?)
response = urlopen(req, timeout=20).read()
response_close = urlopen(req, timeout=20).close()
page_soup = soup(response, "html.parser")
Transfers_info_table_1 = page_soup.find("div", {"class": "table-responsive"})
print(Transfers_info_table_1)

The interesting thing is, when I run this, I get the following output:

<div class="table-responsive" style="visibility:hidden;">
<iframe frameborder="0" id="tokentxnsiframe" scrolling="no" src="" style="width: 100px; height: 600px; min-width: 100%;"></iframe>
</div>

I was expecting to get the output for the whole table of transfers. What am I doing wrong here?


Solution

  • Since the table is present inside iframe.Copy the src value of the iframe and then using request get the content of that url.

    from urllib.request import Request, urlopen
    from bs4 import BeautifulSoup as soup
    import pandas as pd
    
    url = 'https://etherscan.io/token/generic-tokentxns2?m=normal&contractAddress=0xd6a55c63865affd67e2fb9f284f87b7a9e5ff3bd&a=0xd071f6e384cf271282fc37eb40456332307bb8af'
    req = Request(url, headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'})   # I got this line from another post since "uClient = uReq(URL)" and "page_html = uClient.read()" would not work (I beleive that etherscan is attemption to block webscraping or something?)
    response = urlopen(req, timeout=20).read()
    response_close = urlopen(req, timeout=20).close()
    page_soup = soup(response, "html.parser")
    Transfers_info_table_1 = page_soup.find("table", {"class": "table table-md-text-normal table-hover mb-4"})
    df=pd.read_html(str(Transfers_info_table_1))[0]
    df.to_csv("TransferTable.csv",index=False)
    

    Generated csv.

    enter image description here