Search code examples
python-3.xweb-scrapingbeautifulsoupethereuminnerhtml

Web Scraping Innerhtml


I am trying to WebScrape with the BeautifulSoup Python3 library at https://etherscan.io/ for an open-source project. Specifically, I want to grab a row's txn address that has a "To" column of "Contract Creation" (i.e., the inner html).

Take for example the line at this link using the inspect element feature of firefox:

<a href="/address/0x65a0cdb8e79ae3e0c54436362206fd0769335234" title="0x65a0cdb8e79ae3e0c54436362206fd0769335234">Contract Creation</a>

Here is some code:

    url = https://etherscan.io/txs?block=8086187
    response = requests.get(url, timeout=5)
    content = BeautifulSoup(response.content, "html.parser")
    page = content.findAll('td', attrs={"span": ""})
    page = ''.join(str(page))
    if(page.find("Contract Creation") != -1):
    ## find tx that matches with contract
    for i in range(len(page)):
            if i+1 != len(page):
            if({LINE AT CURRENT PAGE == "Contract Creation"})
                tx.append(TXN address); break;

For this page, expected output should be:

0x48a97150373ca517723db6c39eebcda34719e73a9adb975d5912f21c5a9b4971

I am having trouble pulling out the specific information. As of now, I am just making sure the page has a contract creation and then trying to find that. I could hardcode it and check for a line that says

if(page[i[ == "c" and page[i+1] == "o" and page[i+2] == "n"...)
  txn.append(page(i-someNumber:page[i-anotherNumber])

but this isn't efficient.

Even better, would be just getting the contract addresses, which is located in the title. If I can grab the specific <a href> line, then I could feasibly grab the contract address: 0x65A0cDb8e79Ae3e0c54436362206fd0769335234


Solution

  • With bs4 4.7.1. you can use nth-of-type and :contains to search the 6th column for that string. Then use :has to get the parent row and again nth-of-type to get the first column value associated with row i.e. the txn. The url has query string params so you can pull back more results at a time. You can use Session for efficiency of re-using connection.

    The idea is to show the components and a framework for matching and extracting. You could be looping a list of urls instead for example.


    CSS selectors:

    Diagram explaining selector combination:

    click to enlarge

    enter image description here


    Python3:

    from bs4 import BeautifulSoup as bs
    import requests
    
    results = [] 
    
    with requests.Session() as s:
        for page in range(1,10):
            r = s.get('https://etherscan.io/txs?ps=51&p={}'.format(page))
            soup = bs(r.content, 'lxml')
            txns_current = [item.text for item in soup.select('tr:has(td:nth-of-type(6):contains("Contract Creation")) td:nth-of-type(1)')]
            if txns_current:
                results.append(txns_current)
    
    final = [item for i in results for item in i]
    

    Additional reading:

    Css selectors are covered extensively here:

    Note: support for :contains and :has is with bs4 4.7.1. nth-of-type is widely supported.

    You can practice selectors here: