python-3.x web-scraping beautifulsoup ethereum innerhtml

Web Scraping Innerhtml

I am trying to WebScrape with the BeautifulSoup Python3 library at https://etherscan.io/ for an open-source project. Specifically, I want to grab a row's txn address that has a "To" column of "Contract Creation" (i.e., the inner html).

Take for example the line at this link using the inspect element feature of firefox:

<a href="/address/0x65a0cdb8e79ae3e0c54436362206fd0769335234" title="0x65a0cdb8e79ae3e0c54436362206fd0769335234">Contract Creation</a>

Here is some code:

    url = https://etherscan.io/txs?block=8086187
    response = requests.get(url, timeout=5)
    content = BeautifulSoup(response.content, "html.parser")
    page = content.findAll('td', attrs={"span": ""})
    page = ''.join(str(page))
    if(page.find("Contract Creation") != -1):
    ## find tx that matches with contract
    for i in range(len(page)):
            if i+1 != len(page):
            if({LINE AT CURRENT PAGE == "Contract Creation"})
                tx.append(TXN address); break;

For this page, expected output should be:

0x48a97150373ca517723db6c39eebcda34719e73a9adb975d5912f21c5a9b4971

I am having trouble pulling out the specific information. As of now, I am just making sure the page has a contract creation and then trying to find that. I could hardcode it and check for a line that says

if(page[i[ == "c" and page[i+1] == "o" and page[i+2] == "n"...)
  txn.append(page(i-someNumber:page[i-anotherNumber])

but this isn't efficient.

Even better, would be just getting the contract addresses, which is located in the title. If I can grab the specific <a href> line, then I could feasibly grab the contract address: 0x65A0cDb8e79Ae3e0c54436362206fd0769335234

Solution

With bs4 4.7.1. you can use nth-of-type and :contains to search the 6th column for that string. Then use :has to get the parent row and again nth-of-type to get the first column value associated with row i.e. the txn. The url has query string params so you can pull back more results at a time. You can use Session for efficiency of re-using connection.

The idea is to show the components and a framework for matching and extracting. You could be looping a list of urls instead for example.

CSS selectors:

Diagram explaining selector combination:

_{click to enlarge}

Python3:

from bs4 import BeautifulSoup as bs
import requests

results = [] 

with requests.Session() as s:
    for page in range(1,10):
        r = s.get('https://etherscan.io/txs?ps=51&p={}'.format(page))
        soup = bs(r.content, 'lxml')
        txns_current = [item.text for item in soup.select('tr:has(td:nth-of-type(6):contains("Contract Creation")) td:nth-of-type(1)')]
        if txns_current:
            results.append(txns_current)

final = [item for i in results for item in i]

Additional reading:

Css selectors are covered extensively here:

https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors.

Note: support for :contains and :has is with bs4 4.7.1. nth-of-type is widely supported.

You can practice selectors here: