Search code examples
pythonbeautifulsoupurlparse

How to extract the html link from a html page in python?


From this python code,

...
resp = logout_session.get(logout_url, headers=headers, verify=False, allow_redirects=False)
soup = BeautifulSoup(resp.content, "html.parser")
print(soup.prettify())

I was able to make an API call, and the response content is of this:

<!DOCTYPE html>
<html>
 <head>...</head>
 <body>
  <div class="container">
   <div class="title logo" id="header">
    <img alt="" id="business-logo-login" src="/customviews/image/business_logo:f0a067275aba3c71c62cffa2f50ac69c/"/>
   </div>
   <div class="input-group alert alert-success text-center" id="title" role="alert">
    Successfully signed out
   </div>
   <div class="input-group alert text-center">
    <a href="/saml-idp/portal/">
     Login again
    </a>
   </div>
   <div>
    <p>
     You will be redirected to https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/ after 5 seconds ...
    </p>
    <script language="javascript" nonce="">
     window.onload = window.setTimeout(function() {
    window.location.replace("https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU");}, 5000);
    </script>
   </div>
  </div>
 </body>
</html>

Now I want to extract the html link:

https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU 

from this content, does anyone know how to do it in python ?


Solution

  • Try:

    import re
    
    # resp = requests.get(...)
    
    url = re.search(r'window\.location\.replace\("([^"]+)', resp.text).group(1)
    print(url)
    

    Prints:

    https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU