I am trying to WebScrape with the BeautifulSoup Python3 library at https://etherscan.io/ for an open-source project. Specifically, I want to grab a row's txn address that has a "To" column of "Contract Creation" (i.e., the inner html).
Take for example the line at this link using the inspect element feature of firefox:
<a href="/address/0x65a0cdb8e79ae3e0c54436362206fd0769335234" title="0x65a0cdb8e79ae3e0c54436362206fd0769335234">Contract Creation</a>
Here is some code:
url = https://etherscan.io/txs?block=8086187
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
page = content.findAll('td', attrs={"span": ""})
page = ''.join(str(page))
if(page.find("Contract Creation") != -1):
## find tx that matches with contract
for i in range(len(page)):
if i+1 != len(page):
if({LINE AT CURRENT PAGE == "Contract Creation"})
tx.append(TXN address); break;
For this page, expected output should be:
0x48a97150373ca517723db6c39eebcda34719e73a9adb975d5912f21c5a9b4971
I am having trouble pulling out the specific information. As of now, I am just making sure the page has a contract creation and then trying to find that. I could hardcode it and check for a line that says
if(page[i[ == "c" and page[i+1] == "o" and page[i+2] == "n"...)
txn.append(page(i-someNumber:page[i-anotherNumber])
but this isn't efficient.
Even better, would be just getting the contract addresses, which is located in the title. If I can grab the specific <a href>
line, then I could feasibly grab the contract address: 0x65A0cDb8e79Ae3e0c54436362206fd0769335234
With bs4 4.7.1. you can use nth-of-type
and :contains
to search the 6th column for that string. Then use :has
to get the parent row and again nth-of-type
to get the first column value associated with row i.e. the txn
. The url has query string params so you can pull back more results at a time. You can use Session
for efficiency of re-using connection.
The idea is to show the components and a framework for matching and extracting. You could be looping a list of urls instead for example.
CSS selectors:
Diagram explaining selector combination:
click to enlarge
Python3:
from bs4 import BeautifulSoup as bs
import requests
results = []
with requests.Session() as s:
for page in range(1,10):
r = s.get('https://etherscan.io/txs?ps=51&p={}'.format(page))
soup = bs(r.content, 'lxml')
txns_current = [item.text for item in soup.select('tr:has(td:nth-of-type(6):contains("Contract Creation")) td:nth-of-type(1)')]
if txns_current:
results.append(txns_current)
final = [item for i in results for item in i]
Additional reading:
Css selectors are covered extensively here:
Note: support for :contains
and :has
is with bs4 4.7.1. nth-of-type
is widely supported.
You can practice selectors here: