Search code examples
pythonweb-scrapingbeautifulsoupurl-rewritinghref

How do I replace all URLs in HTML with their final redirect?


Preferably using BeautifulSoup, as I'm already using it for other purposes. But any Python solution is fine.

    s = BeautifulSoup(bodyhtml, features="lxml")
    items = s.find_all("div", {"class": "text-block"})
    # I want to replace all URLs in `items` with their final redirect.

Here is a sample URL:

https://tracking.tldrnewsletter.com/CL0/https:%2F%2Farstechnica.com%2Finformation-technology%2F2020%2F04%2Fmeet-dark_nexus-quite-possibly-the-most-potent-iot-botnet-ever%2F/1/0100017163ab9f84-cfdbd3c3-ef8c-4b34-b2a0-f6f4b8f78359-000000/BEB0JUmMqamX4piPthkn_oJ78cjvd6UocEmGf7iO5Pk=136

Here is item[5] (All items are alike):

<div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a><br/><br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for $129. Stadia Pro will cost $9.99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span><br/></span><br/></div>

Solution

  • Get the relevant a elements. Replace the prefix to the href attribute with an empty string, assuming the prefixes are all the same. Get rid of anything following the first /. Then un-escape it like this:

    from bs4 import BeautifulSoup
    from urllib.parse import unquote
    
    
    html = """
    <head>
    
        <body>
            <p>
                <div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://tracking.tldrnewsletter.com/CL0/https:%2F%2Fwww.polygon.com%2F2020%2F4%2F8%2F21213551%2Fgoogle-stadia-free-pro-subscription/1/010001715e86638d-8bd389c9-f9eb-4b68-ade4-c2d706ea5ecb-000000/J3pqLEKSYUvxNOcq8090EHiTSXXHiZtRNM6JD1aQP8s=136"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
                    <br/>
                    <br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for $129. Stadia Pro will cost $9.99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
                    <br/>
                    </span>
                    <br/>
                </div>
            </p>
    
            </body>
    </head>
    """
    
    s = BeautifulSoup(html, features="lxml")
    for a in s.select('div.text-block a'):
            a['href'] = unquote(a['href'].replace("https://tracking.tldrnewsletter.com/CL0/", "").split('/')[0])
    print(s)
    

    Outputs:

        <html><head>
    </head><body>
    <p>
    </p><div class="text-block"><span style="color: rgb(0, 0, 0);"><a href="https://www.polygon.com/2020/4/8/21213551/google-stadia-free-pro-subscription"><span style="font-size: 14px;"><strong>Google Stadia now free to anyone with a Gmail address (2 minute read)</strong></span></a>
    <br/>
    <br/><span style='font-size: 14px; font-family: "Helvetica Neue", Helvetica, Arial, Verdana, sans-serif;'>Google Stadia is now free to anyone with a Gmail address. New users will receive two months of Stadia Pro for free. Existing Stadia Pro users won't be charged for the next two months. Nine games are included with the offer. Access to Stadia previously required purchasing a Google Stadia Premier Edition bundle for $129. Stadia Pro will cost $9.99 a month after the two-month trial period ends. Users can cancel their subscriptions online at any time.</span>
    <br/>
    </span>
    <br/>
    </div>
    </body>
    </html>