Search code examples
pythonpython-3.xbeautifulsoupwebrequest

Display the entire string if a partial match is found in a webpage using python and beautifulsoup


I managed to extract what I wanted in the snippet below however, I think its problematic. I need help in returning the entire string based on the partial match.

import requests
url = "https://bscscan.com/address/0x88c20beda907dbc60c56b71b102a133c1b29b053#code"
queries = ["twitter", "www.", "https://t.me"]

r = requests.get(url)
for q in queries:
    q = q.lower()
    if q in r.text.lower():
        if q.startswith(tuple(queries)):
            print("Found ", q)
        else:
            print("Not Found ", q)

Current Output:

Found  www.
Found  https://t.me

Wanted Output: #-- return the whole string

Found - www.shibuttinu.com
Found - https://t.me/Shibuttinu
Not Found - twitter

Solution

  • You could build a regular expression with your given queries. The following example assumes your whole strings are terminated by quotes a space or a newline (which might not always be the case?)

    import requests
    import re
    
    url = "https://bscscan.com/address/0x88c20beda907dbc60c56b71b102a133c1b29b053#code"
    r = requests.get(url)
    
    queries = ["twitter", "www.", "https://t.me"]
    re_queries = '|'.join(re.escape(q) for q in queries)
    valid_url = "[a-z0-9:/?\-=&.]"
    re_query = rf"['\" ]({valid_url}*?({re_queries}){valid_url}*?)['\"\n]"
    
    for match in re.finditer(re_query, r.text, re.I):
        print(match.groups()[0])
    

    This would return whole strings as:

    twitter:card
    twitter:title
    twitter:description
    twitter:site
    twitter:image
    https://www.googletagmanager.com/gtag/js?id=UA-46998878-23
    www.shibuttinu.com
    https://t.me/shibuttinu
    https://www.binance.org/en/smartChain
    https://twitter.com/BscScan
    Twitter
    

    What this is trying to do is locate all of your queries, but only if they proceeded with certain valid characters and also only if they are enclosed in quotes or a space. The regular expression syntax allows these restrictions to be defined. The use of the re.I flag allows these tests to be case insensitive (so removing the need to lowercase the text).