Search code examples
pythonbeautifulsouppython-requestsurllib

Scraping a specific GTAG value from a website


I am trying to scrape website and return their GTM container ID , I found a solution which is only working for a single specific website.

Which is working for : (https://www.observepoint.com/)

import urllib3
import re
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
response = http.request('GET', "https://www.observepoint.com/")
soup = BeautifulSoup(response.data,"html.parser")
GTM = soup.head.findAll(text=re.compile(r'GTM'))
print(re.search("GTM-[A-Z0-9]{6,7}",str(GTM))[0])

But when I try it on another website for example https://www.dccomics.com/characters/superman%26sa%3DU%26ved%3D2ahUKEwi55uyMxfHxAhXMp5UCHTkMBekQFjAzegQIARAB%26usg%3DAOvVaw2PgfF7ZT6S6UeZpFImsXDC%2Cdccomics

it doesn't work (Returns None Object type) even though the GTM id value still exists and is on a same/similar iframe tag like in the previous website.

GTM Value for working script: GTM Value for Working Website:

GTM Value for the website script isn't functioning on: GTM Value for Website the code isn't working on


Solution

  • import requests
    import re
    
    urls = [
        "https://www.observepoint.com/",
        "https://www.dccomics.com/characters/superman%26sa%3DU%26ved%3D2ahUKEwi55uyMxfHxAhXMp5UCHTkMBekQFjAzegQIARAB%26usg%3DAOvVaw2PgfF7ZT6S6UeZpFImsXDC%2Cdccomics",
    ]
    
    
    def main(url):
        for url in urls:
            r = requests.get(url)
            match = re.findall("(GTM-[A-Z0-9]{6,7})", r.text)
            if match:
                print(set(match))
    
    
    main("https://www.dccomics.com/characters/superman/")
    

    Output:

    {'GTM-5LS3NZ'}
    {'GTM-538C4X'}