I managed to extract what I wanted in the snippet below however, I think its problematic. I need help in returning the entire string based on the partial match.
import requests
url = "https://bscscan.com/address/0x88c20beda907dbc60c56b71b102a133c1b29b053#code"
queries = ["twitter", "www.", "https://t.me"]
r = requests.get(url)
for q in queries:
q = q.lower()
if q in r.text.lower():
if q.startswith(tuple(queries)):
print("Found ", q)
else:
print("Not Found ", q)
Current Output:
Found www.
Found https://t.me
Wanted Output: #-- return the whole string
Found - www.shibuttinu.com
Found - https://t.me/Shibuttinu
Not Found - twitter
You could build a regular expression with your given queries
. The following example assumes your whole strings are terminated by quotes a space or a newline (which might not always be the case?)
import requests
import re
url = "https://bscscan.com/address/0x88c20beda907dbc60c56b71b102a133c1b29b053#code"
r = requests.get(url)
queries = ["twitter", "www.", "https://t.me"]
re_queries = '|'.join(re.escape(q) for q in queries)
valid_url = "[a-z0-9:/?\-=&.]"
re_query = rf"['\" ]({valid_url}*?({re_queries}){valid_url}*?)['\"\n]"
for match in re.finditer(re_query, r.text, re.I):
print(match.groups()[0])
This would return whole strings as:
twitter:card
twitter:title
twitter:description
twitter:site
twitter:image
https://www.googletagmanager.com/gtag/js?id=UA-46998878-23
www.shibuttinu.com
https://t.me/shibuttinu
https://www.binance.org/en/smartChain
https://twitter.com/BscScan
Twitter
What this is trying to do is locate all of your queries, but only if they proceeded with certain valid characters and also only if they are enclosed in quotes or a space. The regular expression syntax allows these restrictions to be defined. The use of the re.I
flag allows these tests to be case insensitive (so removing the need to lowercase the text).