Search code examples
pythonpython-3.xurllib

URL encoding and unshortening


I have some links that I am collecting from a sitemap and twitter. The problem is that some links are in Arabic like this one:

https://www.aljazeera.net/videos/2021/3/1/ياقوتيا-مدينة-روسية-يتجمد-فيها-كل-شيء

I am trying to unshorten the shortened twitter links and decode the Arabic encoded links to get links that look like this:

https://www.aljazeera.net/videos/2021/3/1/%D9%8A%D8%A7%D9%82%D9%88%D8%AA%D9%8A%D8%A7-%D9%85%D8%AF%D9%8A%D9%86%D8%A9-%D8%B1%D9%88%D8%B3%D9%8A%D8%A9-%D9%8A%D8%AA%D8%AC%D9%85%D8%AF-%D9%81%D9%8A%D9%87%D8%A7-%D9%83%D9%84-%D8%B4%D9%8A%D8%A1

Solution

  • If your goal is to take a url with odd characters in it, and convert it to the %XX format, you can use python's builtin, urllib, to decode the link:

    >>> import urllib
    >>> oddlink = 'https://www.aljazeera.net/videos/2021/3/1/ياقوتيا-مدينة-روسية-يتجمد-فيها-كل-شيء'
    >>> goodlink = urllib.parse.quote(oddlink)
    >>> print(goodlink)
    https%3A//www.aljazeera.net/videos/2021/3/1/%D9%8A%D8%A7%D9%82%D9%88%D8%AA%D9%8A%D8%A7-%D9%85%D8%AF%D9%8A%D9%86%D8%A9-%D8%B1%D9%88%D8%B3%D9%8A%D8%A9-%D9%8A%D8%AA%D8%AC%D9%85%D8%AF-%D9%81%D9%8A%D9%87%D8%A7-%D9%83%D9%84-%D8%B4%D9%8A%D8%A1
    

    Keep in mind that it will parse the : after https to %3A. You can manually override this:

    >>> goodlink = goodlink[0:5] + ':' + goodlink[6:]
    >>> print(goodlink)
    https://www.aljazeera.net/videos/2021/3/1/%D9%8A%D8%A7%D9%82%D9%88%D8%AA%D9%8A%D8%A7-%D9%85%D8%AF%D9%8A%D9%86%D8%A9-%D8%B1%D9%88%D8%B3%D9%8A%D8%A9-%D9%8A%D8%AA%D8%AC%D9%85%D8%AF-%D9%81%D9%8A%D9%87%D8%A7-%D9%83%D9%84-%D8%B4%D9%8A%D8%A1
    

    Or, you can add the : as a 'safe' character, meaning that urllib.parse will ignore it, and leave it be:

    >>> urllib.parse.quote(oddlink, safe='/:')
    

    The / is part of the safe characters because it is a very important part of links:

    https://www.google.com/ #with slash
    https:%2F%2Fwww.google.com%2F #without slash
    

    The / character is included by default in the safe characters, but when changing the safe characters, you need to make sure to include it.