URL encoding and unshortening

I have some links that I am collecting from a sitemap and twitter. The problem is that some links are in Arabic like this one:

https://www.aljazeera.net/videos/2021/3/1/ياقوتيا-مدينة-روسية-يتجمد-فيها-كل-شيء

I am trying to unshorten the shortened twitter links and decode the Arabic encoded links to get links that look like this:

https://www.aljazeera.net/videos/2021/3/1/%D9%8A%D8%A7%D9%82%D9%88%D8%AA%D9%8A%D8%A7-%D9%85%D8%AF%D9%8A%D9%86%D8%A9-%D8%B1%D9%88%D8%B3%D9%8A%D8%A9-%D9%8A%D8%AA%D8%AC%D9%85%D8%AF-%D9%81%D9%8A%D9%87%D8%A7-%D9%83%D9%84-%D8%B4%D9%8A%D8%A1

Solution

If your goal is to take a url with odd characters in it, and convert it to the %XX format, you can use python's builtin, urllib, to decode the link:

>>> import urllib
>>> oddlink = 'https://www.aljazeera.net/videos/2021/3/1/ياقوتيا-مدينة-روسية-يتجمد-فيها-كل-شيء'
>>> goodlink = urllib.parse.quote(oddlink)
>>> print(goodlink)
https%3A//www.aljazeera.net/videos/2021/3/1/%D9%8A%D8%A7%D9%82%D9%88%D8%AA%D9%8A%D8%A7-%D9%85%D8%AF%D9%8A%D9%86%D8%A9-%D8%B1%D9%88%D8%B3%D9%8A%D8%A9-%D9%8A%D8%AA%D8%AC%D9%85%D8%AF-%D9%81%D9%8A%D9%87%D8%A7-%D9%83%D9%84-%D8%B4%D9%8A%D8%A1

Keep in mind that it will parse the : after https to %3A. You can manually override this:

>>> goodlink = goodlink[0:5] + ':' + goodlink[6:]
>>> print(goodlink)
https://www.aljazeera.net/videos/2021/3/1/%D9%8A%D8%A7%D9%82%D9%88%D8%AA%D9%8A%D8%A7-%D9%85%D8%AF%D9%8A%D9%86%D8%A9-%D8%B1%D9%88%D8%B3%D9%8A%D8%A9-%D9%8A%D8%AA%D8%AC%D9%85%D8%AF-%D9%81%D9%8A%D9%87%D8%A7-%D9%83%D9%84-%D8%B4%D9%8A%D8%A1

Or, you can add the : as a 'safe' character, meaning that urllib.parse will ignore it, and leave it be:

>>> urllib.parse.quote(oddlink, safe='/:')

The / is part of the safe characters because it is a very important part of links:

https://www.google.com/ #with slash
https:%2F%2Fwww.google.com%2F #without slash

The / character is included by default in the safe characters, but when changing the safe characters, you need to make sure to include it.